Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TridiagSolver (local): "bulkerify" rank1 problem solution #860

Merged
merged 10 commits into from
May 31, 2023

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented May 9, 2023

This PR aims at reducing micro-tasking for rank1 problem solution (local).

TODO:

  • tridiagonal eigensolver: uniformize usage of end index #843
  • Is re-use of z vector workspace a good thing? (it allows to allocate just one vector workspace for partial results, see next point)
  • Workspaces for multi-threading in bulk are allocated as vector instead of re-using a Matrix in a workspace. Is it ok?
  • Apply also row-permutations in bulk? (@rasolca)
  • Evaluate if something can be done for set0
  • Do we want to embed setUnitDiag?
  • How many nthreads for bulk?

@albestro albestro self-assigned this May 9, 2023
@rasolca
Copy link
Collaborator

rasolca commented May 9, 2023

* [ ]  Is re-use of `z` vector workspace a good thing? (it allows to allocate just one vector workspace for partial results, see next point)

z (z0 in #858) is needed only in the deflation step. Then it is permuted in ztmp (z1) and not needed anymore until the parent merge. As the next z can be assembled only when the child eigenvectors are computed, overlap is not possible in any case and therefore it is safe to reuse it.

* [ ]  Workspaces for multi-threading in bulk are allocated as vector instead of re-using a Matrix in a workspace. Is it ok?

Ok for allocating in the bulk, but vectors bypass umpire. Is it ok?

* [ ]  Apply also row-permutations in bulk? (@rasolca)

It would be good to already embed it.

* [ ]  Evaluate if something can be done for set0

Not sure what you are talking about.

* [ ]  Do we want to embed `setUnitDiag`?

It is not needed, however if we opt for the row permutation in bulk approach, setUnitDiag needs to be rewritten anyway.
Embedding it means avoiding to release all the tiles of the evec matrix and reacquire them in a when_all.

* [ ]  How many `nthreads` for bulk?

Depends on the current merge size. For sure it needs a tuning option, but we should think about which is the best way.

Comment on lines 513 to 525
if (i == j)
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if statements in inner loops should be avoided (performance problem).
It might have a negligible effect on the full algorithm, but maybe the two loops option should be considered.

If you want to keep this approach please consider to embed ll 500-508 in the i==j part (improved code readability)

Base automatically changed from rasolca/tridiag_rename to master May 15, 2023 14:55
@albestro albestro force-pushed the alby/rank1-problem branch 2 times, most recently from 44cc3fe to ccb24b0 Compare May 26, 2023 08:47
commit 44cc3fe
Author: Alberto Invernizzi <[email protected]>
Date:   Mon May 22 16:44:39 2023 +0200

    add single-thread blas scope

commit cc1aab3
Author: Alberto Invernizzi <[email protected]>
Date:   Tue May 9 16:23:42 2023 +0200

    minor comment

commit ca1ce17
Author: Alberto Invernizzi <[email protected]>
Date:   Tue May 9 16:18:27 2023 +0200

    make also last step multi-threaded

commit 9cd3039
Author: Alberto Invernizzi <[email protected]>
Date:   Tue May 9 15:56:43 2023 +0200

    use z for the final result of w
    this enables re-using the same workspace for making multi-thread last step

commit 7cab190
Author: Alberto Invernizzi <[email protected]>
Date:   Tue May 9 15:41:09 2023 +0200

    parallelize 2nd step instead of running single-thread

commit e1b9df4
Author: Alberto Invernizzi <[email protected]>
Date:   Tue May 9 12:29:53 2023 +0200

    remove usage of device workspace (now it works also GPU branch)

commit 0a24262
Author: Alberto Invernizzi <[email protected]>
Date:   Mon May 8 18:51:24 2023 +0200

    starting point: working implementation with "bulk", but mono-task.

    There is the full new code structure, but work is not yet splitted over multiple workers.
    Moreover, it is just CPU: for GPU it should be just missing a workspace.

commit 2ba7447
Author: Mikael Simberg <[email protected]>
Date:   Mon May 22 14:07:01 2023 +0200

    Minor updates to `Pipeline`, `TilePipeline`, and `Tile` (#881)

    - Add reset and valid member functions to TilePipeline
    - Add reset and valid member functions to Pipeline
    - Add default constructors to Tile and TileData
    - Implement RetiledMatrix::done in terms of TilePipeline::reset

commit 1c03180
Author: Mikael Simberg <[email protected]>
Date:   Mon May 22 12:45:49 2023 +0200

    Add GCC 12 CI configuration (#853)

commit b02b7a4
Author: Raffaele Solcà <[email protected]>
Date:   Mon May 22 10:01:06 2023 +0000

    Revise content of `/misc` (#876)

commit aeccbc3
Author: Mikael Simberg <[email protected]>
Date:   Mon May 22 10:43:11 2023 +0200

     Only ignore build* in root of repository (#874)

commit 9fcc5ae
Author: Alberto Invernizzi <[email protected]>
Date:   Mon May 22 10:39:25 2023 +0200

    bug fix: warnings as error for cuda/rocm were not enabled (#879)
@albestro albestro requested review from rasolca and msimberg May 26, 2023 09:50
@albestro albestro marked this pull request as ready for review May 26, 2023 09:50
@albestro
Copy link
Collaborator Author

cscs-ci run

@codecov-commenter
Copy link

codecov-commenter commented May 26, 2023

Codecov Report

Merging #860 (c31139b) into master (2ba7447) will increase coverage by 1.39%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #860      +/-   ##
==========================================
+ Coverage   93.50%   94.90%   +1.39%     
==========================================
  Files         134      121      -13     
  Lines        8335     7425     -910     
  Branches     1074     1010      -64     
==========================================
- Hits         7794     7047     -747     
+ Misses        366      224     -142     
+ Partials      175      154      -21     
Impacted Files Coverage Δ
include/dlaf/common/vector.h 100.00% <ø> (ø)
...ude/dlaf/eigensolver/get_red2band_panel_nworkers.h 100.00% <100.00%> (ø)
...lude/dlaf/eigensolver/get_tridiag_rank1_nworkers.h 100.00% <100.00%> (ø)
include/dlaf/eigensolver/tridiag_solver/merge.h 100.00% <100.00%> (ø)
src/init.cpp 75.86% <100.00%> (-3.25%) ⬇️

... and 35 files with indirect coverage changes

Comment on lines 23 to 27
// Note: precautionarily we leave at least 1 thread "free" to do other stuff
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;

// 1 <= number of workers < max_workers
return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::min/max also need #include <algorithm>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the comment above the return is wrong in many ways, but the most important wrong bit is that number of workers is not ensured to be inside that range.

With max_workers = 0, that we get when there is a single thread inside the default pool, current implementation does max(1, min(0, x)) = 1 (given x >= 0).

The first trivial clamp replacement would look like

std::clamp<std::size_t>(nworkers, 1, max_workers);

which specifically ends up in the corner case with following values

std::clamp<std::size_t>(nworkers, 1, 0); // corner case

But from clamp doc

The behavior is undefined if the value of lo is greater than hi.

so we should manage differently this corner case. The solution might be

  // `max_workers` should be at least 1
  const std::size_t max_workers = std::max(1, pika::resource::get_thread_pool("default").get_os_thread_count() - 1);
  return std::clamp<std::size_t>(nworkers, 1, max_workers);

Anyhow, whatever option we will take, a probably worth discussion to have is if it is a safety requirement to leave 1 rank free from bulk operation or it is a strong requirement (and we require at least two threads in default pool).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the undefined behaviour. In that case the min/max version that you already had there is possibly the clearest. I'm happy with either version.

Regarding number of threads, I don't think it should be a strong requirement that the default pool has at least two worker threads. If it is, we should make sure to lift that requirement (since it implies we're doing something blocking).

Independent of clamp vs. min/max, couldn't you leave out at least some of the std::size_t template parameters:

Suggested change
// Note: precautionarily we leave at least 1 thread "free" to do other stuff
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;
// 1 <= number of workers < max_workers
return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers));
// Note: precautionarily we leave at least 1 thread "free" to do other stuff
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1;
// 1 <= number of workers < max_workers
return std::max(std::size_t(1), std::min(max_workers, nworkers));

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about 8eab3d1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at least to me.

include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved
include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved
include/dlaf/tune.h Show resolved Hide resolved
src/init.cpp Outdated Show resolved Hide resolved
@albestro
Copy link
Collaborator Author

cscs-ci run

@rasolca rasolca merged commit 744997e into master May 31, 2023
@rasolca rasolca deleted the alby/rank1-problem branch May 31, 2023 10:42
rasolca pushed a commit that referenced this pull request Jul 7, 2023
Following #860, this applies the same concepts to the distributed implementation.
Main changes:
- reduce micro-tasking
- rank1-solution is fully computed on MC even for GPU backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants