Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TridiagSolver: fix missing sort in the deflation #960

Merged
merged 9 commits into from
Aug 30, 2023
Merged

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Aug 21, 2023

(probable fix for #953)

Deflation process might produce changes in the order of deflated eigenvalues, and the part taking care of keeping them sorted was not implemented in our version. So, deflated eigenvalues were not always sorted correctly and this ended up with wrong results or NaN values. (See dlaed2 for more details)

Thanks to @RMeli and @rasolca for the investigation and the support in fixing this.

In addition to the bug fix, this introduces also std::hypot for avoiding possible numerical errors in the deflation step (it has been preferred the cppstd one over the lapack dlapy2, without any strong reason).

Another change is about removal of unused GPU kernels related to this step (namely stablePartitionIndexOnDevice and related).

TODO:

  • Add some note/doc about the change
  • Improve test for stablePartitionIndexForDeflation
  • Open issue about improving tests for tridiag solver or at least add the check in miniapp_tridiag_solver

@albestro albestro added this to the release v0.2.0 milestone Aug 21, 2023
@albestro albestro requested review from rasolca and RMeli August 21, 2023 16:07
@albestro albestro self-assigned this Aug 21, 2023
@albestro albestro added Type:Bug Something isn't working Priority:High labels Aug 21, 2023
@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro albestro changed the title TridiagSolver: fix missing sort in the deflation bug TridiagSolver: fix missing sort in the deflation Aug 21, 2023
@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2023

Codecov Report

Merging #960 (eedf028) into master (6084329) will increase coverage by 1.47%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #960      +/-   ##
==========================================
+ Coverage   93.35%   94.83%   +1.47%     
==========================================
  Files         143      129      -14     
  Lines        8605     7795     -810     
  Branches     1103     1049      -54     
==========================================
- Hits         8033     7392     -641     
+ Misses        388      238     -150     
+ Partials      184      165      -19     
Files Changed Coverage Δ
include/dlaf/eigensolver/tridiag_solver/kernels.h 100.00% <ø> (ø)
include/dlaf/eigensolver/tridiag_solver/merge.h 99.81% <100.00%> (+<0.01%) ⬆️

... and 35 files with indirect coverage changes

@albestro albestro force-pushed the fix-tridiag-solver branch 3 times, most recently from c781736 to 1012688 Compare August 25, 2023 06:52
@albestro albestro marked this pull request as ready for review August 28, 2023 12:22
include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved
include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved
@rasolca
Copy link
Collaborator

rasolca commented Aug 30, 2023

cscs-ci run

@albestro
Copy link
Collaborator Author

TL;DR

  • 👍 Using spack we were able to build CP2K with DLAF support using @RMeli's branch (thanks @mathieu for the valuable support!)
  • 🕵️‍♂️ We were able to collect information about input files to use for testing with CP2K (thanks @rasolca)
  • 🖥️ We tested just PizDaint-MC
  • 🎉 We used H20-128 as testing configuration and all runs we did with DLAF as backend all reported the same total energy at all intermediate steps (up to ~1e-13), but ...
  • 🤔 ... energy values we obtained in our runs differs from what @RMeli reported in the issue, so we are still not fully sure we used the right input configuration for CP2K.

Build CP2K

  • use CP2K Rocco's branch
  • modify CP2K spack package to enable DLAF backend
  • ⚠️ intel-oneapi-mkl problem with dla-future, switch to intel-mkl
  • 🪛 Intel MKL provides FFTW but it is not found by cmake. Adding cray-fftw as dep clashes with MKL because of a double provider for fftw-api. Manually changed the intel-mkl spack package to not provide fftw-api (comment the provides directive in it)
  • dla-future used is 202308/dev + PR#946 (band2trid + fixes(tag))

Test convergence H20-128

InputFile

ialberto@daint103:~/workspace/cp2k> git --no-pager diff H2O-128.inp
diff --git a/H2O-128.inp b/H2O-128.inp
index 53bf03706..6b2bb0761 100644
--- a/H2O-128.inp
+++ b/H2O-128.inp
@@ -8,23 +8,21 @@
       REL_CUTOFF 30
     &END MGRID
     &QS
-      EPS_DEFAULT 1.0E-12
+      # EPS_DEFAULT 1.0E-12
       WF_INTERPOLATION PS
       EXTRAPOLATION_ORDER 3
     &END QS
     &SCF
       SCF_GUESS ATOMIC
-      &OT ON
-        MINIMIZER DIIS
-      &END OT
+      &DIAGONALIZATION ON
+        ALGORITHM STANDARD
+      &END DIAGONALIZATION
     # SCF_GUESS        RESTART
     # EPS_SCF      1.0E-7
-
       &PRINT
         &RESTART OFF
         &END
       &END
-
     &END SCF
     &XC
       &XC_FUNCTIONAL Pade
@@ -434,14 +432,12 @@
 &END FORCE_EVAL
 &GLOBAL
   PROJECT H2O-128
-  RUN_TYPE MD
+  RUN_TYPE ENERGY
   PRINT_LEVEL LOW
+  # PREFERRED_DIAG_LIBRARY scalapack
+  PREFERRED_DIAG_LIBRARY dlaf
+  &FM
+    NCOL_BLOCKS 512
+    NROW_BLOCKS 512
+  &END FM
 &END GLOBAL
-&MOTION
-  &MD
-    ENSEMBLE NVE
-    STEPS 10
-    TIMESTEP 0.5
-    TEMPERATURE 300.0
-  &END MD
-&END MOTION

Scalapack vs DLAF @ PizDaint-MC

All runs converged and all steps reported the same total energy (up to ~1e-13).

image

Scalapack-192

OMP_NUM_THREADS=4 srun -u -o"h2o-128-scalapack.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-scalapack.inp
  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   29.2     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   18.6     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   18.5     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   18.4     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   18.4     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   18.4     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   18.4     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   18.4     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   18.5     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   18.5     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   18.4     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   18.4     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   18.4     0.00000294     -2202.2172429961 -7.28E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097529
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405154

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF256 (RPN=9)

OMP_NUM_THREADS=4 srun -u -o"h2o-128-dlaf256.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp
  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   27.9     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   24.5     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   23.0     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   22.6     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   23.8     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   22.8     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   22.3     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   21.7     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   21.2     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   24.1     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   21.2     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   20.6     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   23.8     0.00000294     -2202.2172429961 -7.28E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097552
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405143

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF1024 (RPN=9)

OMP_NUM_THREADS=4 srun -u -o"h2o-128-dlaf1024.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp
  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   48.4     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   47.1     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   46.2     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   46.7     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   46.8     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   46.3     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   42.7     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   46.4     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   45.3     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   40.3     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   44.8     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   46.0     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   43.3     0.00000294     -2202.2172429961 -7.27E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097552
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405143

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF512 (RPN=2)

OMP_NUM_THREADS=18 srun -u -o"h2o-128-dlaf512-rpn2.out" -n2 -c36 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp
  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   12.6     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   15.2     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   14.9     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   14.9     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   15.0     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   15.0     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   15.2     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   15.1     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   15.1     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   15.1     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   15.1     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   15.1     0.00001440     -2202.2172429954 -4.30E-09
    13 DIIS/Diag.  0.44E-05   15.1     0.00000294     -2202.2172429961 -7.29E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097779
  Hartree energy:                                            2289.23206962205495
  Exchange-correlation energy:                               -533.21351435405131

  Total energy:                                             -2202.21724299609468

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996094683

@rasolca rasolca merged commit 7f96b89 into master Aug 30, 2023
3 checks passed
@rasolca rasolca deleted the fix-tridiag-solver branch August 30, 2023 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:High Type:Bug Something isn't working
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants