-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queuing inhibits releasing of tasks #7396
Comments
The impact is relatively mild. We are obviously recomputing some tasks again even though we shouldn't. Once the workers complete, the scheduler will tell them to release the task again. We will have "zombie" tasks in state released on the scheduler but workers will be clean. When using restart, this situation can produce other follow up failures since we already cleaned up some other state, e.g. TaskPrefixes, see coiled/benchmarks#521 (comment) |
This makes sense, but I'm curious why I assume this means that there's a previous batch of recommendations which recommends a task to |
I'm concerned this is a more deeply rooted issue caused by us just updating the recommendation dict here distributed/distributed/scheduler.py Line 1959 in 7fb9c48
Having key collisions here will inevitably cause problems. I guess we were lucky so far? |
Yeah, I was also thinking that. You could argue recommendations might make more sense as a stack. I think we may implicitly rely on this behavior in too many places though. Another thing is that maybe diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index dbaa7cfa..153d6c80 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -5291,12 +5291,19 @@ class Scheduler(SchedulerState, ServerNode):
recommendations, client_msgs, worker_msgs = r
self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
+ recommendations = self.stimulus_task_slot_opened(stimulus_id=stimulus_id)
+ self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
+
self.send_all(client_msgs, worker_msgs)
def handle_task_erred(self, key: str, stimulus_id: str, **msg) -> None:
r: tuple = self.stimulus_task_erred(key=key, stimulus_id=stimulus_id, **msg)
recommendations, client_msgs, worker_msgs = r
self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
+
+ recommendations = self.stimulus_task_slot_opened(stimulus_id=stimulus_id)
+ self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
+
self.send_all(client_msgs, worker_msgs)
def release_worker_data(self, key: str, worker: str, stimulus_id: str) -> None: Lastly, by just adding this assertion we at least fail validation: diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index dbaa7cfa..7c6b763f 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3088,6 +3088,7 @@ class SchedulerState:
assert ts not in self.unrunnable
assert ts not in self.queued
assert all(dts.who_has for dts in ts.dependencies)
+ assert ts.who_wants or ts.waiters
def _add_to_processing(self, ts: TaskState, ws: WorkerState) -> Msgs:
"""Set a task as processing on a worker and return the worker messages to send""" So perhaps either:
Currently, I think I'm most in favor of the |
The transition logic for queued tasks is unstable and ordering dependent when releasing tasks.
Specifically,
_exit_processing_common
is popping things from the worker queues whenever something leaves the processing state, i.e. specifically also during the transition processing->released.distributed/distributed/scheduler.py
Lines 3136 to 3139 in 7fb9c48
recommendations are generated for a key to be transitioned into processing.
This new recommendation will then overwrite an earlier recommendation to release/forget this task
distributed/distributed/scheduler.py
Line 1959 in 7fb9c48
s.t. the task is never forgotten but are in state processing instead.
Note: These tasks are technically in a corrupt state then because they do not have a
TaskState.who_needs
and they do not have any dependent with awho_needs
but we're not checking this invalidate_state
which is why this issue didn't pop up, I believe (and I suggest to not introduce this. Walking the entire graph for every task is prohibitively expensive and will slow down our tests)cc @gjoseph92
The text was updated successfully, but these errors were encountered: