-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean work abort #1729
Clean work abort #1729
Conversation
b953507
to
35a1e88
Compare
35a1e88
to
b76e667
Compare
scheduleFatalFailure(); | ||
<< getUniqueName() << " failed, propagating " | ||
<< "abort"; | ||
abort(WORK_COMPLETE_FAILURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically 'abort all children and then retry', I'm not sure if naming is good here
I'm not sure if that is 100% correct implementation. I think that we should be able to, for example, abort But it should be 100% possible to do that, as Also I don't think 'Work::run' should only abort when it is in |
Ok, I think it should be possible. My initial concern was that aborting work in Edit: Also, regarding work being in a weird state, there's extra complexity specifically for RunCommandWork, because if we want to abort it while it's running, that means we need some interaction with ProcessManager, since the process might be either in the process queue or already running. So while it should be possible to abort running work, it will be quite complex. |
src/work/Work.h
Outdated
@@ -78,6 +80,7 @@ class Work : public WorkParent | |||
virtual void onRun(); | |||
virtual void onFailureRetry(); | |||
virtual void onFailureRaise(); | |||
virtual void onAbort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add documentation on abort semantics
|
||
// If necessary, propagate abort signal before advancing children | ||
// This is to prevent scheduling any children to run if they are about | ||
// to be in WORK_ABORTING state (such children are scheduled to abort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no WORK_ABORTING
state
@@ -141,10 +145,22 @@ Work::callComplete() | |||
}; | |||
} | |||
|
|||
void | |||
Work::scheduleAbort(CompleteResult result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is strange: I would expect scheduleAbort
to just schedule a call to abort
} | ||
else if (anyChildRaiseFailure()) | ||
else if (anyChildAborted()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand: why would a child aborting cause parents to abort as well?
src/work/Work.cpp
Outdated
// This scenario is handled in `run` method, where abort is scheduled | ||
// instead of success. | ||
|
||
assert(getState() == WORK_PENDING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error handling is wrong: I would expect an exception to be thrown if abort is called at the wrong time.
That said: I am not sure there should ever be a bad time to call abort, if Work is already complete or aborting, it can safely return (no-op)?
src/work/Work.cpp
Outdated
if (allDone) | ||
{ | ||
// Children are ready, schedule abort for work itself. | ||
scheduleAbort(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would be better to just scheduleRun
here
@@ -61,6 +61,13 @@ WorkManagerImpl::notify(std::string const& child) | |||
mApp.getMetrics().NewMeter({"work", "root", "failure"}, "unit").Mark(); | |||
mChildren.erase(child); | |||
} | |||
else if (i->second->getState() == Work::WORK_FAILURE_ABORTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before reviewing this PR, I opened #1755 as I thought semantics were already not super clean and error prone, now that we have abort(ing), we really need to formalize well what is going on, otherwise we're going to run into very strange bugs.
Also, the semantics implied here from onAbort
don't really seem to follow what was described in #1706 (or at least it's unclear that it can work if onAbort
only triggers some work).
I would recommend going back to basics: describe a state machine, its transitions and when certain callbacks (onXYZ
) get called. Right now the mAborting
flag makes it hard to tell which state transitions are valid vs invalid (and what is supposed to happen).
The two ways to abort are:
- somebody wants to abort some work, this causes transitions to
WORK_ABORTING
toWORK_ABORTED
; any work that was in flight is now complete (and was aborted if needed). - abort in preparation for a retry, something like "decision to retry" ->
WORK_ABORTING_FOR_RETRY
(aborting work) ->WORK_PENDING
(reset) ->WORK_RUNNING
(run) ...
@@ -159,6 +159,41 @@ ProcessManagerImpl::shutdown() | |||
} | |||
} | |||
|
|||
void | |||
ProcessManagerImpl::shutdownProcess(std::shared_ptr<ProcessExitEvent> pev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation of shutdownProcess
is not desirable in the typical case (it's doing kill -9 ...
):
we want "clean" shutdown by default in case processes start sub processes. The process hierarchy is typically something like stellar-core -> bash -> aws_cli-> aws_cli_children
, a "force kill" of bash
may leave the aws_cli
process (and children) running.
You can keep this implementation under a "force" parameter, but the MVP may not need this (ie: the only place where we would need it is if we're implementing timeout for abort)
@@ -27,14 +27,14 @@ class ProcessManagerImpl : public ProcessManager | |||
// Subprocesses will be removed asynchronously, hence the lock on | |||
// just this member | |||
std::recursive_mutex mImplsMutex; | |||
std::map<int, std::shared_ptr<ProcessExitEvent::Impl>> mImpls; | |||
std::map<int, std::shared_ptr<ProcessExitEvent>> mImpls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this: it's not mImpls
anymore
@@ -49,11 +49,12 @@ class ProcessManager : public std::enable_shared_from_this<ProcessManager>, | |||
{ | |||
public: | |||
static std::shared_ptr<ProcessManager> create(Application& app); | |||
virtual ProcessExitEvent runProcess(std::string const& cmdLine, | |||
std::string outputFile = "") = 0; | |||
virtual std::shared_ptr<ProcessExitEvent> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we move to using to std::shared_ptr<ProcessExitEvent>
across the board, should we make ProcessExitEvent
non-copyable/movable?
Abort is implemented via the new work interface #1819 |
Graceful abort implementation as described here: #1706
(includes aborting Work in both PENDING and RUNNING states)
Note that after these changes, ProcessManager will need to be updated to abort work first prior to shutting down.