-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1052 Implement epoch scopes #1053
base: develop
Are you sure you want to change the base?
Conversation
87487f7
to
768a1ea
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1053 +/- ##
===========================================
- Coverage 81.67% 81.17% -0.50%
===========================================
Files 733 732 -1
Lines 27909 28281 +372
===========================================
+ Hits 22795 22958 +163
- Misses 5114 5323 +209
|
src/vt/epoch/epoch_manip_make.cc
Outdated
|
||
EpochType EpochManip::nextSeq(EpochScopeType scope, bool is_collective) { | ||
auto& scope_map = is_collective ? scope_collective_ : scope_rooted_; | ||
if (scope_map.find(scope) == scope_map.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simplify the logic here by having an invariant that all create scopes exist in scope_map
, rather than seemingly inserting them on first use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race if one rank creates a scope before the others, then a message gets sent which creates a child epoch (think rooted epoch within that scope) from that one. It will call nextSeq
to generate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write that up in a comment here, and reference it explicitly in scope creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may still be a race in this, but it would be I think bad application design that would let it happen
src/vt/epoch/epoch_manip_make.cc
Outdated
vtAssert(next >= 0, "Scope must be greater than 0"); | ||
vtAssert(next < scope_limit, "Scope must be less than the scope limit"); | ||
vtAssert(not live_scopes_.exists(next), "Scope must not already exist"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be vtAbortIf
instead - the counter is an exhaustible resource, and we have no reason to expect it to only exhaust in debugging builds but not release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've changed one above these to vtAbortIf (which is the out-of-space error), the others are just a logic error. Should those be changed also?
@PhilMiller I think I'm missing some major code to carry to scope down to child epochs created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in discussion, scopes will need to be inherited by any epoch (rooted or collective) created as the child of an existing epoch, and without an explicit scope used for the creation
768a1ea
to
c8d8df9
Compare
c0a1656
to
40ebed6
Compare
I'm thinking about adding a |
So I've found a major problem in how collective epoch scopes are implement wrt nesting. Here's a fairly simple example program that hangs most of the time. The while loop in the scheduler is needed to perturb the order and increase the chance that a message arrives in between. using TestMsg = vt::Message;
static void nestedHandler(TestMsg* msg);
static void nestedHandler2(TestMsg* msg);
TEST_F(TestEpochScopes, test_epoch_scope_3) {
// Test creating nested collective epochs
auto const this_node = theContext()->getNode();
auto const num_nodes = theContext()->getNumNodes();
// Must have at least 2 nodes to run this test
if (num_nodes < 2) {
return;
}
auto scope = theEpoch()->makeScopeCollective();
for (int i = 0; i < 100; i++) {
auto ep = scope.makeEpochCollective();
theMsg()->pushEpoch(ep);
auto next_node = (this_node + 1) % num_nodes;
auto msg = makeMessage<TestMsg>();
theMsg()->sendMsg<TestMsg, nestedHandler>(next_node, msg);
for (int j = 0; j < 5; j++) {
vt::runScheduler();
}
theMsg()->popEpoch(ep);
theTerm()->finishedEpoch(ep);
}
}
static void nestedHandler(TestMsg* msg) {
auto ep = theTerm()->makeEpochCollective();
theMsg()->pushEpoch(ep);
auto const this_node = theContext()->getNode();
auto const num_nodes = theContext()->getNumNodes();
auto next_node = (this_node + 1) % num_nodes;
auto nmsg = makeMessage<TestMsg>();
theMsg()->sendMsg<TestMsg, nestedHandler2>(next_node, nmsg);
theMsg()->popEpoch(ep);
theTerm()->finishedEpoch(ep);
}
static void nestedHandler2(TestMsg* msg) {} |
88242c2
to
06ec8fe
Compare
Here is an overview of what got changed by this pull request: Clones added
============
- tests/unit/epoch/test_epoch_scopes.cc 2
See the complete overview on Codacy |
4f252a4
to
07c1580
Compare
0d5ca12
to
57aa348
Compare
PR tests (clang-8, alpine, mpich)
|
PR tests (intel 19, ubuntu, mpich) Build for 94bfdd6
|
PR tests (nvidia cuda 11.0, ubuntu, mpich) Build for 94bfdd6
|
PR tests (nvidia cuda 10.1, ubuntu, mpich) Build for 94bfdd6
|
165f8c1
to
94bfdd6
Compare
PR tests (gcc-5, ubuntu, mpich) Build for 94bfdd6
|
PR tests (clang-5.0, ubuntu, mpich) Build for 94bfdd6
|
PR tests (gcc-9, ubuntu, mpich, zoltan) Build for 94bfdd6
|
PR tests (clang-3.9, ubuntu, mpich) Build for 94bfdd6
|
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 94bfdd6
|
PR tests (gcc-6, ubuntu, mpich) Build for 94bfdd6
|
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for 94bfdd6
|
PR tests (gcc-7, ubuntu, mpich, trace runtime, LB) Build for 94bfdd6
|
Fixes #1052
TODO:
Addresses multiple issues: