TridiagSolver (local): "bulkerify" rank1 problem solution #860

albestro · 2023-05-09T14:29:55Z

This PR aims at reducing micro-tasking for rank1 problem solution (local).

TODO:

tridiagonal eigensolver: uniformize usage of end index #843
Is re-use of z vector workspace a good thing? (it allows to allocate just one vector workspace for partial results, see next point)
Workspaces for multi-threading in bulk are allocated as vector instead of re-using a Matrix in a workspace. Is it ok?
~~Apply also row-permutations in bulk? (@rasolca)~~
Evaluate if something can be done for set0
~~Do we want to embed setUnitDiag?~~
How many nthreads for bulk?

rasolca · 2023-05-09T18:58:08Z

* [ ]  Is re-use of `z` vector workspace a good thing? (it allows to allocate just one vector workspace for partial results, see next point)

z (z0 in #858) is needed only in the deflation step. Then it is permuted in ztmp (z1) and not needed anymore until the parent merge. As the next z can be assembled only when the child eigenvectors are computed, overlap is not possible in any case and therefore it is safe to reuse it.

* [ ]  Workspaces for multi-threading in bulk are allocated as vector instead of re-using a Matrix in a workspace. Is it ok?

Ok for allocating in the bulk, but vectors bypass umpire. Is it ok?

* [ ]  Apply also row-permutations in bulk? (@rasolca)

It would be good to already embed it.

* [ ]  Evaluate if something can be done for set0

Not sure what you are talking about.

* [ ]  Do we want to embed `setUnitDiag`?

It is not needed, however if we opt for the row permutation in bulk approach, setUnitDiag needs to be rewritten anyway.
Embedding it means avoiding to release all the tiles of the evec matrix and reacquire them in a when_all.

* [ ]  How many `nthreads` for bulk?

Depends on the current merge size. For sure it needs a tuning option, but we should think about which is the best way.

rasolca · 2023-05-09T19:03:17Z

include/dlaf/eigensolver/tridiag_solver/merge.h

+            if (i == j)
+              continue;


if statements in inner loops should be avoided (performance problem).
It might have a negligible effect on the full algorithm, but maybe the two loops option should be considered.

If you want to keep this approach please consider to embed ll 500-508 in the i==j part (improved code readability)

commit 44cc3fe Author: Alberto Invernizzi <[email protected]> Date: Mon May 22 16:44:39 2023 +0200 add single-thread blas scope commit cc1aab3 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 16:23:42 2023 +0200 minor comment commit ca1ce17 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 16:18:27 2023 +0200 make also last step multi-threaded commit 9cd3039 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 15:56:43 2023 +0200 use z for the final result of w this enables re-using the same workspace for making multi-thread last step commit 7cab190 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 15:41:09 2023 +0200 parallelize 2nd step instead of running single-thread commit e1b9df4 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 12:29:53 2023 +0200 remove usage of device workspace (now it works also GPU branch) commit 0a24262 Author: Alberto Invernizzi <[email protected]> Date: Mon May 8 18:51:24 2023 +0200 starting point: working implementation with "bulk", but mono-task. There is the full new code structure, but work is not yet splitted over multiple workers. Moreover, it is just CPU: for GPU it should be just missing a workspace. commit 2ba7447 Author: Mikael Simberg <[email protected]> Date: Mon May 22 14:07:01 2023 +0200 Minor updates to `Pipeline`, `TilePipeline`, and `Tile` (#881) - Add reset and valid member functions to TilePipeline - Add reset and valid member functions to Pipeline - Add default constructors to Tile and TileData - Implement RetiledMatrix::done in terms of TilePipeline::reset commit 1c03180 Author: Mikael Simberg <[email protected]> Date: Mon May 22 12:45:49 2023 +0200 Add GCC 12 CI configuration (#853) commit b02b7a4 Author: Raffaele Solcà <[email protected]> Date: Mon May 22 10:01:06 2023 +0000 Revise content of `/misc` (#876) commit aeccbc3 Author: Mikael Simberg <[email protected]> Date: Mon May 22 10:43:11 2023 +0200 Only ignore build* in root of repository (#874) commit 9fcc5ae Author: Alberto Invernizzi <[email protected]> Date: Mon May 22 10:39:25 2023 +0200 bug fix: warnings as error for cuda/rocm were not enabled (#879)

…ia umpire)

albestro · 2023-05-26T09:50:32Z

cscs-ci run

codecov-commenter · 2023-05-26T11:17:38Z

Codecov Report

Merging #860 (c31139b) into master (2ba7447) will increase coverage by 1.39%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #860      +/-   ##
==========================================
+ Coverage   93.50%   94.90%   +1.39%     
==========================================
  Files         134      121      -13     
  Lines        8335     7425     -910     
  Branches     1074     1010      -64     
==========================================
- Hits         7794     7047     -747     
+ Misses        366      224     -142     
+ Partials      175      154      -21

Impacted Files	Coverage Δ
include/dlaf/common/vector.h	`100.00% <ø> (ø)`
...ude/dlaf/eigensolver/get_red2band_panel_nworkers.h	`100.00% <100.00%> (ø)`
...lude/dlaf/eigensolver/get_tridiag_rank1_nworkers.h	`100.00% <100.00%> (ø)`
include/dlaf/eigensolver/tridiag_solver/merge.h	`100.00% <100.00%> (ø)`
src/init.cpp	`75.86% <100.00%> (-3.25%)`	⬇️

... and 35 files with indirect coverage changes

msimberg · 2023-05-30T07:48:20Z

include/dlaf/eigensolver/get_tridiag_rank1_nworkers.h

+  // Note: precautionarily we leave at least 1 thread "free" to do other stuff
+  const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;
+
+  // 1 <= number of workers < max_workers
+  return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers));


Maybe a case for https://en.cppreference.com/w/cpp/algorithm/clamp?

std::min/max also need #include <algorithm>.

Actually the comment above the return is wrong in many ways, but the most important wrong bit is that number of workers is not ensured to be inside that range.

With max_workers = 0, that we get when there is a single thread inside the default pool, current implementation does max(1, min(0, x)) = 1 (given x >= 0).

The first trivial clamp replacement would look like

std::clamp<std::size_t>(nworkers, 1, max_workers);

which specifically ends up in the corner case with following values

std::clamp<std::size_t>(nworkers, 1, 0); // corner case

But from clamp doc

The behavior is undefined if the value of lo is greater than hi.

so we should manage differently this corner case. The solution might be

// `max_workers` should be at least 1 const std::size_t max_workers = std::max(1, pika::resource::get_thread_pool("default").get_os_thread_count() - 1); return std::clamp<std::size_t>(nworkers, 1, max_workers);

Anyhow, whatever option we will take, a probably worth discussion to have is if it is a safety requirement to leave 1 rank free from bulk operation or it is a strong requirement (and we require at least two threads in default pool).

Good point about the undefined behaviour. In that case the min/max version that you already had there is possibly the clearest. I'm happy with either version.

Regarding number of threads, I don't think it should be a strong requirement that the default pool has at least two worker threads. If it is, we should make sure to lift that requirement (since it implies we're doing something blocking).

Independent of clamp vs. min/max, couldn't you leave out at least some of the std::size_t template parameters:

Suggested change

// Note: precautionarily we leave at least 1 thread "free" to do other stuff

const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;

// 1 <= number of workers < max_workers

return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers));

// Note: precautionarily we leave at least 1 thread "free" to do other stuff

const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;

// 1 <= number of workers < max_workers

return std::max(std::size_t(1), std::min(max_workers, nworkers));

?

what about 8eab3d1?

Looks good at least to me.

include/dlaf/eigensolver/tridiag_solver/merge.h

include/dlaf/tune.h

src/init.cpp

Co-authored-by: Mikael Simberg <[email protected]>

…nd-panel)

albestro · 2023-05-30T16:13:26Z

cscs-ci run

Following #860, this applies the same concepts to the distributed implementation. Main changes: - reduce micro-tasking - rank1-solution is fully computed on MC even for GPU backend

albestro self-assigned this May 9, 2023

rasolca reviewed May 9, 2023

View reviewed changes

msimberg mentioned this pull request May 12, 2023

Use busy waiting barrier in reduction to band #864

Merged

Base automatically changed from rasolca/tridiag_rename to master May 15, 2023 14:55

albestro force-pushed the alby/rank1-problem branch 2 times, most recently from 44cc3fe to ccb24b0 Compare May 26, 2023 08:47

albestro force-pushed the alby/rank1-problem branch from ccb24b0 to 3f317f3 Compare May 26, 2023 09:02

albestro added 3 commits May 26, 2023 11:22

fix capture + remove if nested in the for loop

43e350b

use MemoryView instead of vector as workspace (so that it allocates v…

8b3c5ca

…ia umpire)

add note about set0

68ddfd2

albestro requested review from rasolca and msimberg May 26, 2023 09:50

albestro marked this pull request as ready for review May 26, 2023 09:50

albestro added Priority:High Type:Optimization Type:Cleanup labels May 26, 2023

rasolca added this to the Optimizations milestone May 26, 2023

rasolca approved these changes May 26, 2023

View reviewed changes

msimberg requested changes May 30, 2023

View reviewed changes

msimberg mentioned this pull request May 30, 2023

Check that SingleThreadedBlasScope is created and destroyed on the same std::thread #889

Merged

albestro and others added 6 commits May 30, 2023 14:18

Update src/init.cpp

ec4c45e

Co-authored-by: Mikael Simberg <[email protected]>

Update include/dlaf/eigensolver/tridiag_solver/merge.h

70bcbbc

Co-authored-by: Mikael Simberg <[email protected]>

add comment about the tune parameter

72bdd5a

add missin algorithm include

1aa69da

possible fix for number of workers (for both tridiag-rank1 and red2ba…

8eab3d1

…nd-panel)

fix single threaded blas scope as suggested by @msimberg + comment fix

c31139b

msimberg approved these changes May 30, 2023

View reviewed changes

rasolca merged commit 744997e into master May 31, 2023

rasolca deleted the alby/rank1-problem branch May 31, 2023 10:42

github-actions bot pushed a commit that referenced this pull request May 31, 2023

Doc TridiagSolver (local): "bulkerify" rank1 problem solution (#860)

48c2223

This was referenced Jun 12, 2023

TridiagSolver (distributed): "bulkerify" rank1 problem solution #904

Merged

TridiagSolver: remove unused kernels #906

Merged

RMeli mentioned this pull request Aug 11, 2023

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TridiagSolver (local): "bulkerify" rank1 problem solution #860

TridiagSolver (local): "bulkerify" rank1 problem solution #860

albestro commented May 9, 2023 •

edited

Loading

rasolca commented May 9, 2023

rasolca May 9, 2023

albestro commented May 26, 2023

codecov-commenter commented May 26, 2023 •

edited

Loading

msimberg May 30, 2023

msimberg May 30, 2023

albestro May 30, 2023

msimberg May 30, 2023

albestro May 30, 2023

msimberg May 30, 2023

albestro commented May 30, 2023

TridiagSolver (local): "bulkerify" rank1 problem solution #860

TridiagSolver (local): "bulkerify" rank1 problem solution #860

Conversation

albestro commented May 9, 2023 • edited Loading

rasolca commented May 9, 2023

rasolca May 9, 2023

Choose a reason for hiding this comment

albestro commented May 26, 2023

codecov-commenter commented May 26, 2023 • edited Loading

Codecov Report

msimberg May 30, 2023

Choose a reason for hiding this comment

msimberg May 30, 2023

Choose a reason for hiding this comment

albestro May 30, 2023

Choose a reason for hiding this comment

msimberg May 30, 2023

Choose a reason for hiding this comment

albestro May 30, 2023

Choose a reason for hiding this comment

msimberg May 30, 2023

Choose a reason for hiding this comment

albestro commented May 30, 2023

albestro commented May 9, 2023 •

edited

Loading

codecov-commenter commented May 26, 2023 •

edited

Loading