-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1217: Check for zero objects on a MPI rank/node case after LB runs #1221
Conversation
4f68193
to
4ab18a9
Compare
PR tests (gcc-5, ubuntu, mpich)
|
PR tests (nvidia cuda 10.1, ubuntu, mpich)
|
PR tests (intel 19, ubuntu, mpich)
|
PR tests (intel 18.03, ubuntu, mpich)
|
PR tests (nvidia cuda 11.0, ubuntu, mpich)
|
1 similar comment
PR tests (nvidia cuda 11.0, ubuntu, mpich)
|
PR tests (nvidia cuda 10.1, ubuntu, mpich)
|
PR tests (intel 19, ubuntu, mpich)
|
I'm not sure if I'm checking the right thing, but if we want to only emit a warning - could this work? |
PR tests (intel 18.03, ubuntu, mpich)
|
Codecov Report
@@ Coverage Diff @@
## develop #1221 +/- ##
===========================================
+ Coverage 81.00% 81.01% +0.01%
===========================================
Files 730 730
Lines 28044 28057 +13
===========================================
+ Hits 22716 22731 +15
+ Misses 5328 5326 -2
|
I'm thinking that maybe we should also have an option to disallow this from happening since it might be a non-recoverable error in some cases. Right before asking for the last migration on a node we could stop it if its the last object. @PhilMiller Thoughts? |
A few thoughts, in order of me thinking of them:
|
Anyway, for development/debugging purposes, having a warning that a process is becoming empty which might precede job failure would be useful on request. It could maybe just be a flag value displayed in the post-balancing output from BaseLB, rather than a separate print? |
auto node_load = theNodeStats()->getNodeLoad(); | ||
auto load = node_load->find(thePhase()->getCurrentPhase()); | ||
if (load == node_load->end() or | ||
(load != node_load->end() and load->second.empty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this test may end up being problematic, if we address #566, since it will reflect load on the process distinct from collection objects.
I think any test needs to be per-collection, since application logic is likely to work or fail depending on presence of a member of one or more particular collections. |
👍 for keeping the logs in one place, having |
If we want to stop the last collection element from migrating, we can add something like this to if (elm_holder->numElements() == 1) {
return MigrateStatus::ElementNotLocal;
} |
4ab18a9
to
d889754
Compare
PR tests (nvidia cuda 10.1, ubuntu, mpich)
|
PR tests (nvidia cuda 11.0, ubuntu, mpich)
|
PR tests (intel 18.03, ubuntu, mpich)
|
PR tests (intel 19, ubuntu, mpich)
|
58caa0a
to
e99cf1d
Compare
note: it seems that per-collection information is already printed in vt/src/vt/vrt/collection/manager.impl.h Lines 2842 to 2846 in 92c2ee9
|
e99cf1d
to
1d02993
Compare
if (elm_holder->numElements() == 0) {
vt_debug_print(
lb, node,
"migrateOut: node has no objects assigned after migration\n\n"
);
} or check this condition from running lb tests without the flag, we hit the "0 elements" case multiple times:
with the flag enabled tests run fine, but no such case is detected:
|
c2d1fa9
to
f7eddfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Can you add a sentence in the docs about this option? Maybe the virtual context collection page?
PR tests (nvidia cuda 11.0, ubuntu, mpich)
|
PR tests (nvidia cuda 10.1, ubuntu, mpich)
|
PR tests (intel 18.03, ubuntu, mpich)
|
PR tests (intel 19, ubuntu, mpich)
|
fixes #1217