-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove cancelled, resumed, and long-running states #6844
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 57m 7s ⏱️ - 8m 48s For more details on these failures, see this check. Results for commit c5acdcd. ± Comparison against base commit 1d0701b. ♻️ This comment has been updated with latest results. |
4af7102
to
963f847
Compare
if ts.state == "flight": | ||
ts.done = True | ||
tasks.add(ts) | ||
elif ts.state == "released": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a task be released at this point? The point of having a cancelled state in the past was to avoid this since it can lead to a couple of inconsistencies. I'm concerned that we're opening ourselves to the same problems again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A task that is currently being served by GatherDep or Execute can have any state, except forgotten.
The key code is in
transition_executing_released (cancel executing)
transition_flight_released (cancel flight)
transition_waiting_ready (resume executing)
transition_generic_fetch (resume flight)
transition_released_forgotten (do not forget if currently in GatherDep or Execute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the complexity reduction in this PR is a bit misleading. You are decreasing the number of TaskStateStates and are reducing LOC which at first glance looks as an objective reduction of complexity. However, the devil's in the detail.
The reduction of TaskStateStates causes the remaining states to gain a higher degree of degeneracy. For instance, on main, TSS flight
clearly encodes a task that is currently in the process of being gathered from a remote peer. Further, the schedulers intention and the future flow of this task is clearly determined based on the result of the remote fetch. There is little to no ambiguity in any follow up decision, e.g. "gather dep failed -> Always do X".
On main, change of intended flow is encoded by transitioning the Task to a different TSS, e.g. cancelled
. For this argument, let's stick to a "happy path" / "forward progress path", e.g. waiting->executing->cancelled-{ExecuteDoneEvent}->?
Given the state is cancelled, every ExecuteDoneEvent
has a very well defined outcome. The control flow is unambiguous which leads to relatively easy code with few control branches, i.e. it favors low local code complexity in favor of higher system complexity.
Now, looking at a non-trivial control flow, e.g. waiting->executing->cancelled-{AcquireReplicasEvent|ComputeTaskEvent}->?
this is no longer well defined since the state cancelled
in reality describes two different situations, i.e. cancelled_from_executing
and cancelled_from_flight
and depending on what substate we're dealing with, the outcome would be very different. This substate is encoded in the TaskState._previous
attribute.
This situation is very similar to the resumed
state which, strictly speaking, should break up into resume_to_fetch
and resume_to_waiting
substates, depending on the ts._next
attribute. The implementation around these states may be buggy but conceptually I'm not convinced that we should remove them. The reason why I never broke cancelled
/resumed
further up was mostly because I thought this to be a feasible compromise to reduce the number of required transition functions. That caused the transition functions we have to be overly complex and buggy.
To my best understanding, this PR does not fundamentally change the control flow of a task but merely chooses a different way to describe it. After all, we still can't cancel a gather_dep
or an execute
(the later could be cancelled but it would not allow us to fundamentally simplify the problem so I'll ignore any "abort thread" proposals for the sake of the argument)
This PR now proposes to remove the cancelled
state which will ultimately require us to encode the information I described above in a different way. Specifically, this PR proposes to transition a task directly back to released
and remember the information that a task is still executing by putting a task back to the executing
dict, i.e. this PR increases the degeneracy of both the released
state and the executing
dict.
In different words, on main, the semantic meaning of the executing
dict was simple and unique. It included "all tasks in state executing
". With this PR, it would encode "all tasks in state executing
and some tasks in state released
". Similarly, released
would no longer be a neutral state but a state that means "neutral or still executing or still fetching".
These small and subtle changes of semantics require us to always check all the conditions to infer what the actual state is. Every time we'll interact with a released task, we'll need to check whether it is in a neutral state, it is still executing, etc.
This PR exhibits this increase of complexity in various places. Nice examples of this are _execute_done_common
and _gather_dep_done_common
which previously were basically no-ops. The former doesn't even exist on main. Now, we got switch statements. This is how the finally clause of gather_dep
started and evolved and we invested an awful amount of time to get rid of that.
I would even go as far as claim that there are bugs because we're not dealing with this degeneracy properly. For instance, let's assume a task was in flight, got cancelled and then asked to be computed again. IIUC, we're nowhere dealing with the fact that the task is still in flight but are transitioning the task straight to executing, i.e. we could have a task simultaneously in flight and in executing. That's an entire class of inconsistency problems that originally was causing the first "wave" of deadlocks.
At first glance, this PR reverts a lot of hard effort in making the state machine more explicit. The verbosity of the current code was intentional to a certain degree. The need for this explicitness was already motivated back in #4413 where this entire refactoring started.
Re: long-running
I don't see a need to change anything about long-running
. In fact, I consider merging this with executing
quite misleading. Long running tasks are not handled well right now and this change might address some of these artifacts but they are not working well regardless since thread rejoining is not implemented.
I believe it makes sense to distinguish these two states from an instrumentation POV alone.
I think this is a wild exxageration. It causes three transitions, exactly, to gain a small code branch:
This has not changed.
Just have a look at The control flow in main is, to say the least, chaotic. Also in main, the worker state needs to deal with a GatherDep or Execute finishing while the state is just about anything. This has caused many issues in the past.
This is the theory, and it's well and good. Except that the implementation doesn't do that, and there are points where ts._next can be missing or other weird stuff. I can find them on request. The reason for this general bugginess is that it is just so ridiculously hard to wrap one's head around the cancelled/resumed state. I myself, after spending many weeks refactoring the state machine, did not have a solid grasp on it and only now I can state I fully understand it.
Not true. More philosophically: we have a (somewhat) unescapable problem, which is Execute and GatherDep can't just be cancelled. In main, the way to cope with it is to enter four special states, cancelled(flight), cancelled(executing), resumed(flight->waiting), resumed(executing->fetch), plus buggy intruders like resumed(executing->missing) just to deal with it, and a wealth of very, very special transitions for when each of these 4 states finishes. In this PR, we simply say that the Execute and GatherDep asyncio tasks can just stay there, unattended, until we need to do one of three things:
Agreed, such proposals are interesting but complicated to implement and definitely out of scope.
This is false.
While the long_running set includes
and the in_flight_tasks set includes
Also interesting to note that, given the above, the information encoded in the prev and next attributes is redundant and could be fully extrapolated from the inclusion of the task in one of the three sets.
False. There are very, very few places where this happens, because they're the places which would directly interact with the currently running asyncio tasks:
You're talking about a very different time, where the event handlers where spaghettified together with the actual code running the instructions. I personally do not see any issue, today, in moving business logic away from the
Yes, this is the main feature of this PR.
Could you come up with what these inconsistency problems are? Again, you are talking about a time where Could you come up with a list of examples of how a PR from a junior contributor could subtly cause this to become a problem, without any of the current tests tripping very explicitly about it? I can't come up with any.
I disagree. It simply states that an abandoned GatherDep or Execute instruction should be a no-op for as much as possible.
This PR has no intention to reduce verbosity. However, I already spent many many weeks dealing with issues that were specifically hidden in the transitions from cancelled and from resumed, particularly the intersections with other edge cases, e.g. #6685.
I personally lost count of how many PRs I already wrote trying to fix the long-running state, all of which were caused by a change at some point in
Yes, and the reason is the one above.
This change fixes all the problems of having a double state executing/long-running which must behave in the same way. With this PR, the one and only place where long-running is treated differently from executing is a single line in
This is a completely separate issue and it should be treated as out of scope. |
...ok, I found an issue that may scupper the whole design. In this events stream:
the task may not find However, this is a big problem:
then you are expected to resume the task - but z may not be there anymore, and run_spec most likely changed too. I already encountered the same problem with resource_restrictions, and I could use the same logic for run_spec. dependencies is a lot more problematic. I'm unsure if there's a clean way to deal with this which does not make Worker.execute less dumb than it is now (which is a huge feature) and is robust against subtle race conditions which are very hard to reproduce - a bunch of |
49cd044
to
7fbc8ba
Compare
I fixed the race condition I described in my previous post. However, while I was dealing with this I encountered, in main, three other very subtle issues in the very same
As discussed during standup, I will now pause this PR, work on the issues above, write the necessary complicated tests for them, and come back here. |
#6869 is an example why I am concerned about letting multiple coroutines for the same task running, i.e. gather_dep and execute. This can cause overlap and our event handlers need to be rock solid to make sure arbitrary overlaps are handled properly which I am currently not sufficiently confident about. Again, I'm advocating for dropping this PR in favor of #6699 for the time being |
We've got another issue automatically fixed by this PR: #6877. |
10e9962
to
6ae138a
Compare
I've removed the |
I've found a roadblock that I didn't consider before. We could mitigate this through the weakref cache in I give up. I'll revert to #6699. |
Sure, we can have a conversation about this. Here the gist of it up front (no need to reply if we find some time to briefly talk about it) TLDR I don't mind reusing the implementation and am open to making the transition methods to deal with both long-running and executing at the same time to avoid code duplication. We're already dealing with a similar situation when dealing with the ready/constrained states and I believe we can handle it similarly. |
da53bd1
to
c5acdcd
Compare
_transition_from_resumed
contains legacy code and documentation #6693resumed->rescheduled
is an invalid transition #6685WorkerState
#6708cancelled->long-running
andresumed->long-running
are invalid transitions #6709AssertionError
inWorkerState._transition_cancelled_error
#6877This is an even more aggressive redesign than #6699 and #6716.
Remove the
long-running
state. A seceded task is distinguished from an executing one exclusively by being in theWorkerState.long_running
set instead of theexecuting
set.Remove the
cancelled
state. Instead, a cancelled task simply transitions to other states, while remaining in theexecuting
,long_running
, orin_flight_tasks
sets. A task will not transition toforgotten
for as long as it is in one of the three above sets.Remove the
resumed
state. Instead,ready
, but it is already in either theexecuting
or thelong_running
sets, it transitions back toexecuting
insteadfetch
, but it is already in thein_flight_tasks
set, it transitions back toflight
insteadAll end events for Execute and GatherDep:
executing
orflight
state respectivelyforgotten
if the task is inreleased
stateAt any given moment, there may be both an Execute and a GatherDep instruction for the same task, running at the same time.
If the Execute instruction finishes while the task is in flight, it will be a no-op, and vice versa. This means we no longer have to worry about mismatched end events.
Remove the
previous
,next
, anddone
TaskState attributes.TODO