-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design cleanup: Work interface is a mess #1755
Comments
I think that some of the implications of this are to ensure that |
I did a bit of investigation of this looking for patterns and how the code it trying to use Work, so here are some thoughts:
While 1 and 2 look to me like direct responsibilities of
|
Multi-part workLet's use What this does:
The problem I see here is that it's actually abusing the way State definitions and callbacksWe need to clearly define what each state is supposed to be. Per my comment for a simple work, I think that the only difference between An example on how we can force people to implement things properly: Work with childrenI think that we can make working with children a lot cleaner with some changes to the state definitions. I think that we could make everything work fine by just using So, instead, I think there should be a way for ExecutionStatus MyWork::onRun() {
// helper function returns:
// DONE: all children done, no errors
// RUNNING: some children are still running
// FAILURE_*: some children failed
auto status = checkAllChildrenDone();
if (status == DONE)
{
// do something else
}
else
{
return status;
}
} With this, I think Work vs WorkParentI agree the separation of concern is not great on this one. I am not sure what the original intent of
|
Work FSM
This makes me wonder: why do we even allow work to put itself back into PENDING state (on the graph, that would be
yes, that is my understanding as well. Though
Are you assuming here the guarantee we discussed before that core won't shutdown until all work is in completed state? Can we also define "important transitions" that you are talking about? Transitioning into a completion state and retrying might be good candidates. Not sure if calling Let's try to define clear guidelines on which callbacks should be used for what: Children management & scheduling
I don't know if I agree on this one. This sounds to me like:
Work & WorkParent
|
Work FSMI agree that it may make sense to ditch the I also think that nobody should be calling
That's a good point: I think the only reason one needs to "restart itself" is to just wait for some outside event (and for that, a callback with for example a timer should be used), any other restart should be managed as a failure (that kicks off a restart).
Mmm, yeah I think Also, if we do this, then Children management & scheduling
no, the executor does not care about what should be run: it always calls "run". The simplest way to think about this is that we first class support for waiting for a callback (triggered by a timer for example). As we first class callbacks, we can provide a wrapper that is the only way to invoke a
btw, I noticed something strange in the code base: mApp.getWorkManager().advanceChildren(); That seems fishy to me: why would a child have to "kick" the work manager? It's probably a work around for a hack of sorts... Work & WorkParent
Yes, maybe with some points of clarification:
If someone needs to add child work, they can simply derive from the With all this: is there a reason to have the |
Ok, from the offline discussion today, looks like I needed to make some changes to the design doc I've been working on that I'll post here -- it's almost done. Meanwhile, some clarifications:
I think this comes from a fact that when work is added, nothing kicks it off, it's just kind of sitting there until |
High-levelGoals
What is provided by the work interface
What is required from the implementers
InterfaceDisclaimer: we need better names for these WorkFSM & Work (child of WorkFSM)
Here are some descriptions of the methods we'd use:
Below are methods for scheduling running and completion of work
virtual void onSuccess()/onFailureRaise()/onFailureFatal()
Additionally, as mentioned above, we introduce a new state
WorkManager : public Work (potentially WorkScheduler?)
|
In general, I would prefer that you first describe interfaces without any implementation details (ie public methods only). Then, you can use those interfaces in
namingI think they should be: interfaces
|
Okay, will update based on the feedback.
Ok, I think it should be more clear once I separate BasicWork and Work
Like I mentioned earlier,
What I meant is that when we transition into important states like retry or failure, child iterator would be reset.
Yes, essentially child in WAITING state gets skipped.
I was talking about |
BasicWorkpublic:
enum State
{
WORK_RUNNING,
WORK_SUCCESS,
WORK_WAITING,
WORK_FAILURE_RETRY,
WORK_FAILURE_RAISE,
WORK_FAILURE_FATAL
};
void reset(); // reset work to its initial state
private:
// only BasicWork deals with state transitions; it also controls IO service
void setState(State newState);
void run(); // see tentative implementation below
void complete(State state); // set appropriate completion state and notify scheduler
void scheduleRun(); // post callback to `run` to IO service
void scheduleComplete(); // post callback to `complete` to IO service
// ...Retries helper functions would also go here; they are very similar to what we have now... //
protected:
State mState{RUNNING};
// Note that these callbacks are protected
// I don't think there's a good reason for letting outside classes access them
virtual void onReset();
virtual void onFailureRetry();
virtual void onFailureRaise();
virtual void onSuccess();
// Implementers provide work logic used in `onRun`, that hints BasicWork what state to transition to.
virtual State doActualWork() = 0;
virtual State onRun( return doActualWork(); );
std::function<void(asio::error_code ec)> mWaitingCallback {...};
// whenever implementers decide to return WAITING state, they need to make sure
// to define an event that re-trigger running (call `wakeUp`). For example, in case of a timer,
// implementers can use a generic waiting callback defined above.
void wakeUp();
void crank(); // schedule proper action based on state (`run`, `retry`, `abort` etc...)
// at this level BasicWork would just crank itself, since there is no notion of work
// manager/children yet, but this kind of weird so we'll let `Work` implement it.
virtual void notifyScheduler() = 0;
void BasicWork::run()
{
auto nextState = onRun();
If (nextState == DONE)
{
scheduleComplete(nextState); // complete will notify work manager
}
else
{
setState(nextState);
notifyScheduler(); // still running or waiting, tell scheduler to crank again
}
}
Work : public BasicWorkpublic:
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args) { ...same as current code... };
protected:
std::vector<std::shared_ptr<Work>> mChildren;
std::weak_ptr<Work> mScheduler; // main scheduler that schedules cranks
// remove completed children, additionally implementers can add new children if needed (e.g. BatchWork)
virtual void manageChildren();
virtual std::shared_ptr<Work> yieldNextChild(); // if child is returned, call crank on it, otherwise crank self
State onRun() override final; // do slightly more in `onRun`, as now we manage children
State doActualWork() override; // may additionally check if children are done
void notifyScheduler() override
{
// get shared_ptr for mScheduler
mScheduler->crank();
}
// ..various helper methods to check state of children.. // With this, State onRun()
{
manageChildren(); // Add more work if needed
Auto child = yieldNextChild();
If (child == nullptr) // run self
{
return doActualWork();
}
else
{
if (child->getState != WAITING)
{
child->crank();
}
return RUNNING;
}
} WorkScheduler : public WorkI think we all the functionality described above (how WS gets notified and children removed), there isn't much for ExamplesWith this, some of the example works would look tentatively like this:
So, implementers have to:
Some other questions/concerns:
|
|
BasicWorkPublic:
enum State
{
WORK_RUNNING,
WORK_SUCCESS,
WORK_WAITING,
WORK_FAILURE_RETRY,
WORK_FAILURE_RAISE,
WORK_FAILURE_FATAL
};
BasicWork(Application& app, std::string uniqueName, size_t maxRetries, std::function<void(State state)> notifyCallback);
virtual ~BasicWork();
State getState();
// to avoid confusion with VirtualClock::crank, let’s call it `crankWork`
void crankWork();
Protected:
// Note that these are empty by default, I think
// implementers should not be forced to implement them
virtual void onReset() {};
virtual void onFailureRetry() {};
virtual void onFailureRaise() {};
virtual void onSuccess() {};
virtual State onRun() = 0;
// Reset is exposed at this level, as `Work` should also be able to reset when adding children
void reset()
{
// .. Work might reset some internal variables here.. //
onReset(); // whatever else implementers need to do (e.g. clean up children)
}
void wakeUp()
{
assert(mState == WAITING);
setState(RUNNING);
mNotifyCallback(mState);
}
// This is a callback that implementers provide that will do whatever needed
// internally then notify its parent
std::function<void(State stateOfChild)> mNotifyCallback;
Private:
State mState{RUNNING};
// Note that there are only 2 places where setState is called: crankWork and wakeUp
void setState(State newState);
// Retries helper functions would also go here; they are very similar to what we have now,
// except I think we could incorporate WAITING state for the retry delay,
// and wakeUp when it's time to retry
// Additionally, retry mechanism should be internal to BasicWork
void crankWork()
{
// We shouldn’t crank on work that is finished or waiting
assert(getState() != DONE && getState() != WAITING);
// For now, this is only for running state, we can add if condition for aborting
auto state = onRun();
setState(state);
if (state == FAILURE)
{
switch(state)
{
Case RETRY:
onFailureRetry();
Case RAISE:
Case FATAL:
onFailureRaise();
}
reset(); // note work is reset on any failure
}
else if (state == SUCCESS)
{
onSuccess();
}
// completed a unit of work
mNotifyCallback(state);
} What everybody can do: start running work and check on its state Work : public BasicWorkPublic:
Work(Application& app, std::string uniqueName, size_t retries, std::function<void(State state)> callback) : public BasicWork(app, uniqueName, retries, callback), ...
virtual ~Work( ...ensure children are properly cleaned up... );
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args)
{
Auto child = make_shared<Work>(...);
addChild(child);
mNotifyCallback();
}
// I imagine we could also have a `addBatchOfWork` (with a better name),
// so that we notify parent once.
// ..Various methods to check the state of children..
Protected:
State onRun() override final
{
auto child = cleanUpAdnYieldNextChild();
if (child)
{
child->crankWork();
return RUNNING;
}
else
{
return doWork();
}
}
// Implementers decide what they want to do: spawn more children,
// wait for all children to finish, or perform work
virtual State doWork() = 0;
void onReset() override
{
clearChildren();
}
Private:
// remove completed children, round-robin next running child (skip `WAITING` children)
std::shared_ptr<Work> cleanUpAdnYieldNextChild();
std::vector<std::shared_ptr<Work>> mChildren;
// Update `mChildren`, reset child work
void addChild(std::shared_ptr<Work> child); What everybody can do: add children, check overall children progress WorkScheduler : public Work
{
public:
WorkScheduler(Application& app);
virtual ~WorkScheduler();
protected:
// When there are no children, all they all are in WAITING state,
// go into waiting state as well until event notification or new children
State doWork() override
{
Return WAITING;
}
} A callback for the scheduler would be slightly different, something like: auto maybeScheduleCrank = [self](State state)
{
// children could all be in WAITING state, avoid busy loop
if (self->anyChildrenRunning())
{
if (self->getState() == WAITING)
{
self->wakeUp();
}
else
{
IO_service.post(self->crankWork());
}
}
} Additionally, I think we might need a static create method for WorkScheduler. We first need to instantiate scheduler class (I guess with Examples
State doWork() override
{
// customized behavior for this work
auto finishStatus = applyBuckets();
if (stillApplyingBuckets())
{
return RUNNING;
}
else
{
return finishStatus == success ? SUCCESS : FAILURE;
}
}
State doWork()
{
if (!mGetRemoteFile)
{
mGetRemoteFile = addWork<GetRemoteFileWork>...;
Return RUNNING;
}
else
{
if (mGetRemoteFile->getState() == SUCCESS)
{
loadFile();
return SUCCESS;
}
else
{
// decide how to proceed if child failed; in this particular case, fail too
return FAILURE;
}
}
}
// A callback defined for children could be like this (passed to `GetRemoteFileWork` in this case)
std::weak_ptr<Work> self = ...get weak ptr…;
auto parentNotify = [self](State state)
{
// Do whatever derived class wants to do internally
// to react to `state`
// get shared ptr for self
self->mNotifyCallback(state); // propagate up
} |
Overall, looks reasonable now. Some questions and objections:
I think it would be helpful if you added more comments to the interface which define in what situations any given function can/should be called. |
Several modifications based on the feedback (pseudocode, might need to be simplified):
{
// children could all be in WAITING state, avoid busy loop
if (self->anyChildrenRunning())
{
// This could happen if `addWork` is called while scheduler is waiting
if (self->getState() == WAITING)
{
self->wakeUp();
}
else
{
IO_service.post(
{
self->crankWork();
self->mNotifyCallback();
});
}
}
}
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args)
{
Auto child = make_shared<Work>(...);
addChild(child);
wakeUp(); // ensure to set state to RUNNING
}
// Assuming `mScheduleSelf` is initialized to “false”
State onRun() override final
{
If (mScheduleSelf || !hasRunningChildren())
{
mScheduleSelf = false;
return doWork();
}
else
{
mScheduleSelf = true;
auto child = cleanUpAndYieldNextChild();
child->crankWork();
return RUNNING;
} I'm working on adding more comments as well as an initial implementation. |
Additional clarifications, and some learnings from the initial implementation of work interface:
Actual callback: mNotifyCallback = [weak]() {
auto self = /* ...get shared_ptr for weak... */
if (self->getState() == BasicWork::WORK_RUNNING)
{
if (self->mScheduled)
{
return;
}
self->mScheduled = true;
self->mApp.getClock().getIOService().post([weak]() {
auto innerSelf = /* ...get shared_ptr for weak... */
innerSelf->mScheduled = false;
innerSelf->crankWork();
innerSelf->mNotifyCallback();
});
}
else if (self->getState() == BasicWork::WORK_WAITING &&
self->anyChildRunning())
{
self->wakeUp();
}
}; Since the scheduler depends on the children, State doWork() override
{
if (anyChildRunning())
{
return BasicWork::WORK_RUNNING;
}
return BasicWork::WORK_WAITING;
} The above aligns with expected action in
|
Also, few more thoughts to see how Like I already mentioned before, there are really two abort triggers: external and internal. An example of an external trigger is system shutdown, where an abort signal is issued to the work scheduler. An internal abort would be the following scenario: parent work A has some children. Consider one of the children goes into FAILURE_RAISE mode. Next time A is scheduled, it needs to take proper action to ensure the rest of the children are aborted. In order for this to happen, a few things to consider:
|
It's quite unclear for implementors what the various methods (onReset, onStart, etc) are really supposed to do, especially when combining Work (what
WorkParent
does for example).For example, I would expect implementations to follow:
onReset
initializes the Work to a pristine state (does NOT create work)onStart
should be the one creating the initial batch of work (also called on retry as retry triggers in sequence reset, advance, start)This is illustrated by
WorkParent
that doesn't manage its children at all: it's up to all implementors to remember to callclearChildren()
on reset for example (or face undefined behavior on retry).The text was updated successfully, but these errors were encountered: