Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean work abort #1729

Closed
wants to merge 8 commits into from
Closed

Conversation

marta-lokhova
Copy link
Contributor

@marta-lokhova marta-lokhova commented Jul 11, 2018

Graceful abort implementation as described here: #1706
(includes aborting Work in both PENDING and RUNNING states)
Note that after these changes, ProcessManager will need to be updated to abort work first prior to shutting down.

scheduleFatalFailure();
<< getUniqueName() << " failed, propagating "
<< "abort";
abort(WORK_COMPLETE_FAILURE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically 'abort all children and then retry', I'm not sure if naming is good here

@vogel
Copy link
Contributor

vogel commented Jul 25, 2018

I'm not sure if that is 100% correct implementation.

I think that we should be able to, for example, abort ApplyBucketsWork during bucket application. Imagine that we are in a middle of the work and ApplyBucketsWork is in WORK_RUNNING state. Then, for some reason, its parent work gets aborted. It does not call abort on ApplyBucketsWork because of the state.

But it should be 100% possible to do that, as ApplyBucketsWork schedules new success after each good application.

Also I don't think 'Work::run' should only abort when it is in PENDING state. It should be able to do that even in RUNNING state, as being in that place suggests, that it is between chunks of work and not in the middle of one.

@marta-lokhova
Copy link
Contributor Author

marta-lokhova commented Jul 27, 2018

Ok, I think it should be possible. My initial concern was that aborting work in running state might leave us in some weird state. But analyzing it more, I think if we can guarantee that work is aborted while in consistent state (e.g. previous chunk is successful, next chunk hasn't started, all connections/file/etc are closed/released), then we should be ok. (onAbort can be used for such checks).

Edit: Also, regarding work being in a weird state, there's extra complexity specifically for RunCommandWork, because if we want to abort it while it's running, that means we need some interaction with ProcessManager, since the process might be either in the process queue or already running. So while it should be possible to abort running work, it will be quite complex.

src/work/Work.h Outdated
@@ -78,6 +80,7 @@ class Work : public WorkParent
virtual void onRun();
virtual void onFailureRetry();
virtual void onFailureRaise();
virtual void onAbort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add documentation on abort semantics


// If necessary, propagate abort signal before advancing children
// This is to prevent scheduling any children to run if they are about
// to be in WORK_ABORTING state (such children are scheduled to abort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no WORK_ABORTING state

@@ -141,10 +145,22 @@ Work::callComplete()
};
}

void
Work::scheduleAbort(CompleteResult result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is strange: I would expect scheduleAbort to just schedule a call to abort

}
else if (anyChildRaiseFailure())
else if (anyChildAborted())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand: why would a child aborting cause parents to abort as well?

// This scenario is handled in `run` method, where abort is scheduled
// instead of success.

assert(getState() == WORK_PENDING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error handling is wrong: I would expect an exception to be thrown if abort is called at the wrong time.

That said: I am not sure there should ever be a bad time to call abort, if Work is already complete or aborting, it can safely return (no-op)?

if (allDone)
{
// Children are ready, schedule abort for work itself.
scheduleAbort(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be better to just scheduleRun here

@@ -61,6 +61,13 @@ WorkManagerImpl::notify(std::string const& child)
mApp.getMetrics().NewMeter({"work", "root", "failure"}, "unit").Mark();
mChildren.erase(child);
}
else if (i->second->getState() == Work::WORK_FAILURE_ABORTED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before reviewing this PR, I opened #1755 as I thought semantics were already not super clean and error prone, now that we have abort(ing), we really need to formalize well what is going on, otherwise we're going to run into very strange bugs.

Also, the semantics implied here from onAbort don't really seem to follow what was described in #1706 (or at least it's unclear that it can work if onAbort only triggers some work).

I would recommend going back to basics: describe a state machine, its transitions and when certain callbacks (onXYZ) get called. Right now the mAborting flag makes it hard to tell which state transitions are valid vs invalid (and what is supposed to happen).

The two ways to abort are:

  • somebody wants to abort some work, this causes transitions to WORK_ABORTING to WORK_ABORTED ; any work that was in flight is now complete (and was aborted if needed).
  • abort in preparation for a retry, something like "decision to retry" -> WORK_ABORTING_FOR_RETRY (aborting work) -> WORK_PENDING (reset) -> WORK_RUNNING (run) ...

@@ -159,6 +159,41 @@ ProcessManagerImpl::shutdown()
}
}

void
ProcessManagerImpl::shutdownProcess(std::shared_ptr<ProcessExitEvent> pev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation of shutdownProcess is not desirable in the typical case (it's doing kill -9 ...):
we want "clean" shutdown by default in case processes start sub processes. The process hierarchy is typically something like stellar-core -> bash -> aws_cli-> aws_cli_children, a "force kill" of bash may leave the aws_cli process (and children) running.

You can keep this implementation under a "force" parameter, but the MVP may not need this (ie: the only place where we would need it is if we're implementing timeout for abort)

@@ -27,14 +27,14 @@ class ProcessManagerImpl : public ProcessManager
// Subprocesses will be removed asynchronously, hence the lock on
// just this member
std::recursive_mutex mImplsMutex;
std::map<int, std::shared_ptr<ProcessExitEvent::Impl>> mImpls;
std::map<int, std::shared_ptr<ProcessExitEvent>> mImpls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this: it's not mImpls anymore

@@ -49,11 +49,12 @@ class ProcessManager : public std::enable_shared_from_this<ProcessManager>,
{
public:
static std::shared_ptr<ProcessManager> create(Application& app);
virtual ProcessExitEvent runProcess(std::string const& cmdLine,
std::string outputFile = "") = 0;
virtual std::shared_ptr<ProcessExitEvent>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we move to using to std::shared_ptr<ProcessExitEvent> across the board, should we make ProcessExitEvent non-copyable/movable?

@marta-lokhova
Copy link
Contributor Author

Abort is implemented via the new work interface #1819

@marta-lokhova marta-lokhova deleted the clean_work_abort branch April 23, 2020 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants