Clean work abort #1729

marta-lokhova · 2018-07-11T18:11:51Z

Graceful abort implementation as described here: #1706
(includes aborting Work in both PENDING and RUNNING states)
Note that after these changes, ProcessManager will need to be updated to abort work first prior to shutting down.

vogel · 2018-07-25T09:38:34Z

src/work/Work.cpp

-        scheduleFatalFailure();
+                            << getUniqueName() << " failed, propagating "
+                            << "abort";
+        abort(WORK_COMPLETE_FAILURE);


this is basically 'abort all children and then retry', I'm not sure if naming is good here

vogel · 2018-07-25T10:18:39Z

I'm not sure if that is 100% correct implementation.

I think that we should be able to, for example, abort ApplyBucketsWork during bucket application. Imagine that we are in a middle of the work and ApplyBucketsWork is in WORK_RUNNING state. Then, for some reason, its parent work gets aborted. It does not call abort on ApplyBucketsWork because of the state.

But it should be 100% possible to do that, as ApplyBucketsWork schedules new success after each good application.

Also I don't think 'Work::run' should only abort when it is in PENDING state. It should be able to do that even in RUNNING state, as being in that place suggests, that it is between chunks of work and not in the middle of one.

marta-lokhova · 2018-07-27T18:37:53Z

Ok, I think it should be possible. My initial concern was that aborting work in running state might leave us in some weird state. But analyzing it more, I think if we can guarantee that work is aborted while in consistent state (e.g. previous chunk is successful, next chunk hasn't started, all connections/file/etc are closed/released), then we should be ok. (onAbort can be used for such checks).

Edit: Also, regarding work being in a weird state, there's extra complexity specifically for RunCommandWork, because if we want to abort it while it's running, that means we need some interaction with ProcessManager, since the process might be either in the process queue or already running. So while it should be possible to abort running work, it will be quite complex.

MonsieurNicolas · 2018-08-14T20:32:57Z

src/work/Work.h

@@ -78,6 +80,7 @@ class Work : public WorkParent
    virtual void onRun();
    virtual void onFailureRetry();
    virtual void onFailureRaise();
+    virtual void onAbort();


you need to add documentation on abort semantics

MonsieurNicolas · 2018-08-14T21:11:48Z

src/work/Work.cpp

+
+    // If necessary, propagate abort signal before advancing children
+    // This is to prevent scheduling any children to run if they are about
+    // to be in WORK_ABORTING state (such children are scheduled to abort


there is no WORK_ABORTING state

MonsieurNicolas · 2018-08-14T21:15:35Z

src/work/Work.cpp

@@ -141,10 +145,22 @@ Work::callComplete()
    };
 }

+void
+Work::scheduleAbort(CompleteResult result)


this is strange: I would expect scheduleAbort to just schedule a call to abort

MonsieurNicolas · 2018-08-14T21:16:42Z

src/work/Work.cpp

    }
-    else if (anyChildRaiseFailure())
+    else if (anyChildAborted())


not sure I understand: why would a child aborting cause parents to abort as well?

MonsieurNicolas · 2018-08-14T21:18:32Z

src/work/Work.cpp

+    // This scenario is handled in `run` method, where abort is scheduled
+    // instead of success.
+
+    assert(getState() == WORK_PENDING);


error handling is wrong: I would expect an exception to be thrown if abort is called at the wrong time.

That said: I am not sure there should ever be a bad time to call abort, if Work is already complete or aborting, it can safely return (no-op)?

MonsieurNicolas · 2018-08-14T21:26:09Z

src/work/Work.cpp

+    if (allDone)
+    {
+        // Children are ready, schedule abort for work itself.
+        scheduleAbort(result);


It seems it would be better to just scheduleRun here

MonsieurNicolas · 2018-08-14T22:15:59Z

src/work/WorkManagerImpl.cpp

@@ -61,6 +61,13 @@ WorkManagerImpl::notify(std::string const& child)
        mApp.getMetrics().NewMeter({"work", "root", "failure"}, "unit").Mark();
        mChildren.erase(child);
    }
+    else if (i->second->getState() == Work::WORK_FAILURE_ABORTED)


Before reviewing this PR, I opened #1755 as I thought semantics were already not super clean and error prone, now that we have abort(ing), we really need to formalize well what is going on, otherwise we're going to run into very strange bugs.

Also, the semantics implied here from onAbort don't really seem to follow what was described in #1706 (or at least it's unclear that it can work if onAbort only triggers some work).

I would recommend going back to basics: describe a state machine, its transitions and when certain callbacks (onXYZ) get called. Right now the mAborting flag makes it hard to tell which state transitions are valid vs invalid (and what is supposed to happen).

The two ways to abort are:

somebody wants to abort some work, this causes transitions to WORK_ABORTING to WORK_ABORTED ; any work that was in flight is now complete (and was aborted if needed).

abort in preparation for a retry, something like "decision to retry" -> WORK_ABORTING_FOR_RETRY (aborting work) -> WORK_PENDING (reset) -> WORK_RUNNING (run) ...

MonsieurNicolas · 2018-08-14T22:29:53Z

src/process/ProcessManagerImpl.cpp

@@ -159,6 +159,41 @@ ProcessManagerImpl::shutdown()
    }
 }

+void
+ProcessManagerImpl::shutdownProcess(std::shared_ptr<ProcessExitEvent> pev)


this implementation of shutdownProcess is not desirable in the typical case (it's doing kill -9 ...):
we want "clean" shutdown by default in case processes start sub processes. The process hierarchy is typically something like stellar-core -> bash -> aws_cli-> aws_cli_children, a "force kill" of bash may leave the aws_cli process (and children) running.

You can keep this implementation under a "force" parameter, but the MVP may not need this (ie: the only place where we would need it is if we're implementing timeout for abort)

MonsieurNicolas · 2018-08-14T22:35:17Z

src/process/ProcessManagerImpl.h

@@ -27,14 +27,14 @@ class ProcessManagerImpl : public ProcessManager
    // Subprocesses will be removed asynchronously, hence the lock on
    // just this member
    std::recursive_mutex mImplsMutex;
-    std::map<int, std::shared_ptr<ProcessExitEvent::Impl>> mImpls;
+    std::map<int, std::shared_ptr<ProcessExitEvent>> mImpls;


rename this: it's not mImpls anymore

MonsieurNicolas · 2018-08-14T22:37:02Z

src/process/ProcessManager.h

@@ -49,11 +49,12 @@ class ProcessManager : public std::enable_shared_from_this<ProcessManager>,
 {
  public:
    static std::shared_ptr<ProcessManager> create(Application& app);
-    virtual ProcessExitEvent runProcess(std::string const& cmdLine,
-                                        std::string outputFile = "") = 0;
+    virtual std::shared_ptr<ProcessExitEvent>


if we move to using to std::shared_ptr<ProcessExitEvent> across the board, should we make ProcessExitEvent non-copyable/movable?

marta-lokhova · 2018-12-20T16:52:47Z

Abort is implemented via the new work interface #1819

marta-lokhova force-pushed the clean_work_abort branch from b953507 to 35a1e88 Compare July 19, 2018 20:40

marta-lokhova added 2 commits July 19, 2018 14:07

Introduce new Work abort state

ff5e3ec

Added work abort tests

b76e667

marta-lokhova force-pushed the clean_work_abort branch from 35a1e88 to b76e667 Compare July 19, 2018 21:08

vogel reviewed Jul 25, 2018

View reviewed changes

marta-lokhova mentioned this pull request Jul 31, 2018

Potential race in *Work workflows when retrying/stopping work #1706

Closed

marta-lokhova added 6 commits August 1, 2018 11:09

Abort work in RUNNING state

815da92

Add ability to shutdown individual ProcessExitEvents

a1ee684

Updated RunCommandWork to handle aborts properly

51e5c16

Improve CallCmdWork test class so it's not maintained separately

9fa2102

Added running work abort test

0f909e6

Added process shutdown tests

84e5c0f

MonsieurNicolas reviewed Aug 14, 2018

View reviewed changes

marta-lokhova closed this Dec 20, 2018

marta-lokhova deleted the clean_work_abort branch April 23, 2020 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean work abort #1729

Clean work abort #1729

marta-lokhova commented Jul 11, 2018 •

edited

Loading

vogel Jul 25, 2018

vogel commented Jul 25, 2018

marta-lokhova commented Jul 27, 2018 •

edited

Loading

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

MonsieurNicolas Aug 14, 2018

marta-lokhova commented Dec 20, 2018

Clean work abort #1729

Clean work abort #1729

Conversation

marta-lokhova commented Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

vogel commented Jul 25, 2018

marta-lokhova commented Jul 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova commented Dec 20, 2018

marta-lokhova commented Jul 11, 2018 •

edited

Loading

marta-lokhova commented Jul 27, 2018 •

edited

Loading