-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2201: implement memory aware temperedlb in vt rebased (new version) #2278
#2201: implement memory aware temperedlb in vt rebased (new version) #2278
Conversation
9e057ad
to
6728145
Compare
Pipelines resultsPR tests (clang-10, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-9, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-11, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (intel icpx, ubuntu, mpich, verbose) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-12, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-13, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-14, ubuntu, mpich, verbose) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (clang-13, alpine, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose) Build for 6411e2d (2024-06-13 12:54:36 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for a83c66a (2024-09-17 13:37:54 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose, kokkos) Build for a83c66a (2024-09-17 13:37:54 UTC)
|
We need to make running with |
e7581a5
to
25a307d
Compare
3bc773d
to
4339a83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed the first review focusing on CMF verification.
Cf. in-code comments @lifflander @nlslatt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that at least one error was found; and at least one other comment deserves further discussion.
Other comments are optional.
0350f16
to
5f8e6a4
Compare
I have applied all the fixes discussed yesterday:
CI seems fine, this still needs a rebase, but I'm holding off until the reviews are finished. |
@lifflander Please have a look at @ppebay comments and resolve them as needed. Maybe we need to create follow up issues based on some of them? |
5f8e6a4
to
ad2ebbb
Compare
…dd option for SwapClusters without memory constraints
@lifflander Not sure why they are broken, I suspect that something went wrong during rebasing. For the record, Git command that worked well for this: |
@cz4rs It's ok for me if you force push the fixed commits here |
8186aa0
to
894ea64
Compare
Signatures are fixed.
|
Following resolution of ccm-milp #13, I suggest we add this test case to the CI suite, @cwschilly: With the In contrast, using the cluster-swapping strategy (and therefore without subclustering), ), we should find |
@ppebay Did you mean beta non-zero? |
#if vt_check_enabled(trace_enabled) | ||
theTrace()->disableTracing(); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.
#if vt_check_enabled(trace_enabled) | ||
theTrace()->enableTracing(); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not need to do this anymore now that #2188 has been merged. In the interests of expediency, let's explore that in a follow-on issue.
@ppebay Can you express the solution in terms of maximum load or load imbalance? Caleb does not have access to the calculation of W_max in CI. |
thus More details:
whereby we see that cluster 2 is split between ranks 2 and 3.
thus More details:
where indeed no sub-clustering has occurred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tests pass this time, I think it's ready
@ppebay @cwschilly For the case that is still failing, we should check if there is a second solution with the same Wmax but that has a worse imbalance. Perhaps we should try adding |
@ppebay Not sure if you were notified of this because I mentioned you incorrectly the first time. |
@nlslatt based on latest discussions, we are going to drop this test (with non-zero |
Fixes #2201
To-do before merging:
CMFTypeEnum::NormBySelf
from the code (both definitions and use) following discussion @nlslatt - @ppebayNOMERGE
commits from history