-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#410: Dependent Epochs rewritten #2204
base: develop
Are you sure you want to change the base?
Conversation
…s to keep progress
Pipelines resultsPR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-9, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-11, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-13, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-12, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-14, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (clang-10, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (gcc-12, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for 48f7a8c (2023-10-31 19:56:21 UTC)
|
* | ||
* \return the new epoch | ||
*/ | ||
EpochType makeEpochRooted( | ||
UseDS use_ds = UseDS{true}, | ||
ParentEpochCapture parent = ParentEpochCapture{} | ||
ParentEpochCapture parent = ParentEpochCapture{}, | ||
bool is_dep = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean parameters (especially defaulted ones) are a bit of an anti-pattern. Preferable to use named enum class
members instead
NodeType this_node_ = uninitialized_destination; | ||
EpochStackType epoch_stack_; | ||
// released epoch list for dependent epochs | ||
std::unordered_set<EpochType> epoch_released_ = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a range container as is used for other stuff like this, so that it won't tend to grow without bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was planning this improvement. The bounds are lower than one might think because they are always removed once the epoch terminates.
vt::theTerm()->addAction([=]{ | ||
EXPECT_EQ(TestDep::num_non_dep, 1); | ||
EXPECT_EQ(TestDep::num_dep, (num_nodes - 1)*k); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of the style choice of having the scheduler run and the test assertions get checked as a result of running to quiescence as the process is finishing up. If nothing else, it makes for really funky looking stack traces if/when an assertion fails.
src/vt/scheduler/scheduler.cc
Outdated
@@ -384,6 +384,17 @@ void Scheduler::resume(ThreadIDType tid) { | |||
suspended_.resumeRunnable(tid); | |||
} | |||
|
|||
void Scheduler::releaseEpoch(EpochType ep) { | |||
auto iter = pending_work_.find(ep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible use of pending_work_.extract(ep)
src/vt/scheduler/scheduler.impl.h
Outdated
ep != no_epoch and ep != term::any_epoch_sentinel and | ||
not theTerm()->epochReleased(ep)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't epochReleased()
already check for the no_epoch
and any_epoch_sentinel
values?
if (node_ == theContext()->getNode()) { | ||
theSched()->releaseEpochObjgroup(epoch, proxy_); | ||
} else { | ||
theMsg()->send<releaseRemoteObjGroup<ObjT>>(vt::Node(node_), *this, epoch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it going to be valid to release an epoch for a single node?
if (auto iter2 = iter->second.find(untyped); iter2 != iter->second.end()) { | ||
auto& container = iter2->second; | ||
while (container.size() > 0) { | ||
work_queue_.emplace(container.pop()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to move
the values from container
, rather than actually removing them, so that no work needs to be done on the structure of container
itself, which is going away right after this loop.
I.e.
for (auto &unit : container) {
work_queue_.emplace(std::move(unit));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this is that container
is not iterable. The container is a hybrid queue implemented differently depending on whether priorities are enabled. If they are not, it is a 64 entry circular buffer with an overflow std::queue
container. We could make it iterable, but it would need a custom iterator.
} | ||
} | ||
} else if (not theTerm()->epochReleased(ep)) { | ||
pending_work_[ep].push(unit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nesting is getting really deep here. It might be nice to peel off some of the cases as early returns, if/as possible.
So, the one big question I have in reviewing this is: why the move to releasing epochs per-collection/objgroup/object? Everything else I've marked in comments is pretty minor, and can/should be addressed after writing up an answer for the big question. Also, squashing fixup commits into their progenitors would be good. Or, even rewriting the history entirely to provide a clearer to understand narrative sequence of commits. |
It looks like creating an epoch within a dependent epoch circumvents waiting for release on all of the messages within the nested epoch. As is, I don't think we could make any assertions about the behavior of general code like this:
|
vt::theCollective()->barrier(); | ||
|
||
auto epoch = vt::theTerm()->makeEpochCollective(term::ParentEpochCapture{}, true); | ||
vt::theMsg()->pushEpoch(epoch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since dependent epochs make this pattern necessary, a returning equivalent of vt::runInEpoch<Collective/Rooted>
would be a nice helper for users.
src/vt/termination/termination.cc
Outdated
theSched()->releaseEpoch(epoch); | ||
} | ||
|
||
void TerminationDetector::onReleaseEpoch(EpochType epoch, ActionType action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the same approach as term actions after #2196
// Terminated epochs are always released | ||
bool const is_term = theEpoch()->getTerminatedWindow(epoch)->isTerminated( | ||
epoch | ||
); | ||
if (is_term) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be true, or an empty dependent epoch could terminate before being released, which seems counter-intuitive.
auto epoch = vt::theTerm()->makeEpochCollective(term::ParentEpochCapture{}, true);
vt::theTerm()->finishedEpoch(epoch);
vt::theTerm()->addAction(epoch, {...});
Your added action could run before releasing in the above, or in cases where a set of functions may or may not actually send messages. We should produce once on dependent epoch creation and consume on release, to align the no-messages behavior with typical behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifflander What are your thoughts on Matthew's concern?
The short answer to this question is that it will allow us to express dependencies per-object. An object can release work when it's ready, which is a per-object state.
Yes, I am planning on re-writing the history. |
@PhilMiller @Matthew-Whitlock Please take a look at the follow-on PR #2206. I nearly have the new abstraction working on top of dependent epochs called vt/examples/collection/jacobi1d_vt.cc Lines 233 to 284 in 99a63d9
|
Ok, I'll take a look. That maybe answers my follow-up question of what it means to release an epoch for some object/collection, but not for others. |
@Matthew-Whitlock I have the 2-D Jacobi correctly running now with asynchronous iterations. It runs faster (even with the dependency resolution) than the synchronized version. vt/examples/collection/jacobi2d_vt.cc Lines 359 to 412 in eb972fb
|
@@ -89,6 +89,24 @@ struct EpochManip : runtime::component::Component<EpochManip> { | |||
*/ | |||
static bool isRooted(EpochType const& epoch); | |||
|
|||
/** | |||
* \brief Gets whether an epoch is DS or onot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
static bool isDS(EpochType epoch); | ||
|
||
/** | ||
* \brief Gets whether an epoch is dependent or onot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
/** | ||
* \brief Release an epoch to run | ||
* | ||
* \param[in] ep the epoch to release | ||
*/ | ||
void releaseEpoch(EpochType ep); | ||
|
||
/** | ||
* \brief Release an epoch to run | ||
* | ||
* \param[in] ep the epoch to release | ||
*/ | ||
void releaseEpochObjgroup(EpochType ep, ObjGroupProxyType proxy); | ||
|
||
/** | ||
* \brief Release an epoch to run | ||
* | ||
* \param[in] ep the epoch to release | ||
*/ | ||
void releaseEpochCollection(EpochType ep, UntypedCollection* untyped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Doxygen here needs to be improved.
auto obj = r->getObj(); | ||
if (ep != no_epoch and ep != term::any_epoch_sentinel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Phil's comment about peeling off cases. If you're not going to do that, you might at least reverse the order of these two lines.
* | ||
* \return whether it is released | ||
*/ | ||
bool isReleasedEpoch(EpochType epoch) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to call this isEpochReleased()
to have naming consistent with theTerm()->isEpochReleased()
.
|
||
if (is_dep) { | ||
// Put the epoch in the released set. The epoch any_epoch_sentinel does not | ||
// count as a succcessor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
@@ -814,16 +863,20 @@ struct TerminationDetector : | |||
// hang detector termination state | |||
TermStateType hang_; | |||
private: | |||
using ActionListType = std::list<ActionType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was ActionListType
defined here?
// Terminated epochs are always released | ||
bool const is_term = theEpoch()->getTerminatedWindow(epoch)->isTerminated( | ||
epoch | ||
); | ||
if (is_term) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifflander What are your thoughts on Matthew's concern?
Fixes #410