-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TridiagSolver (local): "bulkerify" rank1 problem solution #860
Conversation
Ok for allocating in the bulk, but vectors bypass umpire. Is it ok?
It would be good to already embed it.
Not sure what you are talking about.
It is not needed, however if we opt for the row permutation in bulk approach,
Depends on the current merge size. For sure it needs a tuning option, but we should think about which is the best way. |
if (i == j) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if
statements in inner loops should be avoided (performance problem).
It might have a negligible effect on the full algorithm, but maybe the two loops option should be considered.
If you want to keep this approach please consider to embed ll 500-508 in the i==j
part (improved code readability)
44cc3fe
to
ccb24b0
Compare
commit 44cc3fe Author: Alberto Invernizzi <[email protected]> Date: Mon May 22 16:44:39 2023 +0200 add single-thread blas scope commit cc1aab3 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 16:23:42 2023 +0200 minor comment commit ca1ce17 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 16:18:27 2023 +0200 make also last step multi-threaded commit 9cd3039 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 15:56:43 2023 +0200 use z for the final result of w this enables re-using the same workspace for making multi-thread last step commit 7cab190 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 15:41:09 2023 +0200 parallelize 2nd step instead of running single-thread commit e1b9df4 Author: Alberto Invernizzi <[email protected]> Date: Tue May 9 12:29:53 2023 +0200 remove usage of device workspace (now it works also GPU branch) commit 0a24262 Author: Alberto Invernizzi <[email protected]> Date: Mon May 8 18:51:24 2023 +0200 starting point: working implementation with "bulk", but mono-task. There is the full new code structure, but work is not yet splitted over multiple workers. Moreover, it is just CPU: for GPU it should be just missing a workspace. commit 2ba7447 Author: Mikael Simberg <[email protected]> Date: Mon May 22 14:07:01 2023 +0200 Minor updates to `Pipeline`, `TilePipeline`, and `Tile` (#881) - Add reset and valid member functions to TilePipeline - Add reset and valid member functions to Pipeline - Add default constructors to Tile and TileData - Implement RetiledMatrix::done in terms of TilePipeline::reset commit 1c03180 Author: Mikael Simberg <[email protected]> Date: Mon May 22 12:45:49 2023 +0200 Add GCC 12 CI configuration (#853) commit b02b7a4 Author: Raffaele Solcà <[email protected]> Date: Mon May 22 10:01:06 2023 +0000 Revise content of `/misc` (#876) commit aeccbc3 Author: Mikael Simberg <[email protected]> Date: Mon May 22 10:43:11 2023 +0200 Only ignore build* in root of repository (#874) commit 9fcc5ae Author: Alberto Invernizzi <[email protected]> Date: Mon May 22 10:39:25 2023 +0200 bug fix: warnings as error for cuda/rocm were not enabled (#879)
ccb24b0
to
3f317f3
Compare
cscs-ci run |
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## master #860 +/- ##
==========================================
+ Coverage 93.50% 94.90% +1.39%
==========================================
Files 134 121 -13
Lines 8335 7425 -910
Branches 1074 1010 -64
==========================================
- Hits 7794 7047 -747
+ Misses 366 224 -142
+ Partials 175 154 -21
|
// Note: precautionarily we leave at least 1 thread "free" to do other stuff | ||
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1; | ||
|
||
// 1 <= number of workers < max_workers | ||
return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a case for https://en.cppreference.com/w/cpp/algorithm/clamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::min/max
also need #include <algorithm>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the comment above the return is wrong in many ways, but the most important wrong bit is that number of workers is not ensured to be inside that range.
With max_workers = 0
, that we get when there is a single thread inside the default pool, current implementation does max(1, min(0, x)) = 1
(given x >= 0
).
The first trivial clamp
replacement would look like
std::clamp<std::size_t>(nworkers, 1, max_workers);
which specifically ends up in the corner case with following values
std::clamp<std::size_t>(nworkers, 1, 0); // corner case
But from clamp
doc
The behavior is undefined if the value of lo is greater than hi.
so we should manage differently this corner case. The solution might be
// `max_workers` should be at least 1
const std::size_t max_workers = std::max(1, pika::resource::get_thread_pool("default").get_os_thread_count() - 1);
return std::clamp<std::size_t>(nworkers, 1, max_workers);
Anyhow, whatever option we will take, a probably worth discussion to have is if it is a safety requirement to leave 1 rank free from bulk operation or it is a strong requirement (and we require at least two threads in default pool).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about the undefined behaviour. In that case the min
/max
version that you already had there is possibly the clearest. I'm happy with either version.
Regarding number of threads, I don't think it should be a strong requirement that the default pool has at least two worker threads. If it is, we should make sure to lift that requirement (since it implies we're doing something blocking).
Independent of clamp
vs. min
/max
, couldn't you leave out at least some of the std::size_t
template parameters:
// Note: precautionarily we leave at least 1 thread "free" to do other stuff | |
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1; | |
// 1 <= number of workers < max_workers | |
return std::max<std::size_t>(1, std::min<std::size_t>(max_workers, nworkers)); | |
// Note: precautionarily we leave at least 1 thread "free" to do other stuff | |
const std::size_t max_workers = pika::resource::get_thread_pool("default").get_os_thread_count() - 1; | |
// 1 <= number of workers < max_workers | |
return std::max(std::size_t(1), std::min(max_workers, nworkers)); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about 8eab3d1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at least to me.
Co-authored-by: Mikael Simberg <[email protected]>
Co-authored-by: Mikael Simberg <[email protected]>
cscs-ci run |
Following #860, this applies the same concepts to the distributed implementation. Main changes: - reduce micro-tasking - rank1-solution is fully computed on MC even for GPU backend
This PR aims at reducing micro-tasking for rank1 problem solution (local).
TODO:
z
vector workspace a good thing? (it allows to allocate just one vector workspace for partial results, see next point)Apply also row-permutations in bulk? (@rasolca)Do we want to embedsetUnitDiag
?nthreads
for bulk?