Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

RNHTTR · 2023-03-14T21:13:55Z

closes: #28120
closes: #21225
closes: #28943

Tasks occasionally get stuck in queued and aren't resolved by stalled_task_timeout (#28120). This PR moves the logic for handling stalled tasks to the scheduler and simplifies the logic by marking any task that has been queued for more than scheduler.task_queued_timeout as failed, allowing it to be retried if the task has available retries.

This doesn't require an additional scheduler nor allow for the possibility of tasks to get stuck in an infinite loop of scheduled -> queued -> scheduled ... -> queued as exists in #28943.

collinmcnulty · 2023-03-14T21:29:12Z

For backward compatibility and semver reasons, would making the tasks go to failed be considered a breaking change? I know that behavior was different than the comparable k8s executor behavior and seems a little askew from expectations so it could be considered a bug I guess? And this does cover that case plus more, so that's good.

RNHTTR · 2023-03-14T21:38:56Z

Would appreciate your feedback @okayhooni & @repl-chris !

airflow/jobs/scheduler_job.py

collinmcnulty · 2023-03-14T21:39:18Z

airflow/jobs/scheduler_job.py

@@ -853,6 +858,10 @@ def _run_scheduler_loop(self) -> None:
        # Check on start up, then every configured interval
        self.adopt_or_reset_orphaned_tasks()

+        if self._task_queued_timeout:
+            self._fail_tasks_stuck_in_queued()
+            timers.call_regular_interval(self._task_queued_timeout, self._fail_tasks_stuck_in_queued)


Do you need both of these lines? Don't you only want the one wrapped in timers.call_regular_interval

I was following the precedent set by adopt_or_reset_orphaned_tasks (line 858), which runs on scheduler start up and then at a regular interval.

This probably doesn't need to run at startup. Adoption makes sense to do because it's pretty likely that another scheduler just shut down if we have a new one starting, but I don't we have a similar situation here.

uranusjr · 2023-03-15T05:50:35Z

airflow/config_templates/default_airflow.cfg

@@ -1037,6 +1037,7 @@ task_track_started = True
 # :ref:`task duration<ui:task-duration>` from the task's :ref:`landing time<ui:landing-times>`.
 task_adoption_timeout = 600

+# Deprecated. Use scheduler.task_queued_timeout instead.


Is this deprecation reflected in the code base by a DeprecationWarning?

Updated to include DeprecationWarning

we should remove it if no longer needed

Would removing it unnecessarily break backward compatibility?

airflow/jobs/scheduler_job.py

ephraimbuddy · 2023-03-16T20:02:19Z

airflow/jobs/scheduler_job.py

@@ -1408,6 +1416,40 @@ def _send_sla_callbacks_to_processor(self, dag: DAG) -> None:
        )
        self.executor.send_callback(request)

+    @provide_session
+    def _fail_tasks_stuck_in_queued(self, session: Session = NEW_SESSION) -> None:


This would affect all executors, not just celery and we have some other settings in kubernetes for pending tasks etc. WDYT?

IMO this is a general problem that applies to both kubernetes & celery executors. The relevant k8s exec setting is worker_pods_queued_check_interval -- I think this can also be handled in the scheduler. I also think this can probably replace task-adoption-timeout

If you agree, i'll remove those configurations as well.

worker_pods_queued_check_interval is similar, but different in that it won't automatically just reset the TI. It first checks to see if the pod exists.

worker_pods_pending_timeout is essentially this same process though. It should probably be deprecated as well (though, not sure how config handles many -> one).

jedcunningham · 2023-03-20T23:01:13Z

airflow/jobs/scheduler_job.py

+        self._task_queued_timeout = conf.getfloat(
+            "scheduler",
+            "task_queued_timeout",
+            fallback=stalled_task_timeout,


You don't have to do this manually, Airflow does it for you when you marked it as a deprecated option.

jedcunningham · 2023-03-20T23:14:42Z

airflow/jobs/scheduler_job.py

@@ -1408,6 +1416,40 @@ def _send_sla_callbacks_to_processor(self, dag: DAG) -> None:
        )
        self.executor.send_callback(request)

+    @provide_session
+    def _fail_tasks_stuck_in_queued(self, session: Session = NEW_SESSION) -> None:


worker_pods_queued_check_interval is similar, but different in that it won't automatically just reset the TI. It first checks to see if the pod exists.

worker_pods_pending_timeout is essentially this same process though. It should probably be deprecated as well (though, not sure how config handles many -> one).

jedcunningham · 2023-03-20T23:17:18Z

airflow/jobs/scheduler_job.py

@@ -853,6 +858,10 @@ def _run_scheduler_loop(self) -> None:
        # Check on start up, then every configured interval
        self.adopt_or_reset_orphaned_tasks()

+        if self._task_queued_timeout:
+            self._fail_tasks_stuck_in_queued()
+            timers.call_regular_interval(self._task_queued_timeout, self._fail_tasks_stuck_in_queued)


This probably doesn't need to run at startup. Adoption makes sense to do because it's pretty likely that another scheduler just shut down if we have a new one starting, but I don't we have a similar situation here.

ashb · 2023-03-21T16:56:55Z

airflow/jobs/scheduler_job.py

+        queued for longer than `self._task_queued_timeout` as failed. If the task has
+        available retries, it will be retried.


I don't see any handling of retry state in this code, we go straight to failed and bypass retry logic (unless I've forgotten how that works?)

Would it be sufficient to call TI.handle_failure()?

It seems like it should be based on my read of TI.handle_failure.

collinmcnulty · 2023-03-27T18:06:06Z

I'm unclear on why task adoption was removed. It covers a whole class of problem, running tasks whose scheduler died, that doesn't seem to be otherwise affected by this PR.

RNHTTR · 2023-03-28T20:47:47Z

Accidentally closed this and nuked my changes. I'll open a new PR or re-open this one.

RNHTTR requested review from kaxil, ashb, XD-DENG and o-nikolas as code owners March 14, 2023 21:13

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 14, 2023

collinmcnulty reviewed Mar 14, 2023

View reviewed changes

uranusjr reviewed Mar 15, 2023

View reviewed changes

RNHTTR force-pushed the main branch from de10899 to 66b35e3 Compare March 15, 2023 14:32

RNHTTR requested review from uranusjr and collinmcnulty and removed request for ashb, kaxil, XD-DENG, o-nikolas, uranusjr and collinmcnulty March 15, 2023 14:33

potiuk force-pushed the main branch from 66b35e3 to 0675d3c Compare March 16, 2023 00:10

ephraimbuddy reviewed Mar 16, 2023

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

RNHTTR force-pushed the main branch from 0675d3c to dff2a3a Compare March 16, 2023 14:53

RNHTTR mentioned this pull request Mar 16, 2023

Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

Closed

ephraimbuddy reviewed Mar 16, 2023

View reviewed changes

jedcunningham reviewed Mar 20, 2023

View reviewed changes

ashb reviewed Mar 21, 2023

View reviewed changes

RNHTTR requested a review from dstandish as a code owner March 24, 2023 21:21

RNHTTR closed this Mar 28, 2023

RNHTTR force-pushed the main branch from c20613c to 04be4c0 Compare March 28, 2023 20:26

RNHTTR mentioned this pull request Mar 30, 2023

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

RNHTTR commented Mar 14, 2023

collinmcnulty commented Mar 14, 2023

RNHTTR commented Mar 14, 2023 •

edited

Loading

collinmcnulty Mar 14, 2023

RNHTTR Mar 15, 2023

jedcunningham Mar 20, 2023

uranusjr Mar 15, 2023

RNHTTR Mar 15, 2023

ephraimbuddy Mar 16, 2023

RNHTTR Mar 16, 2023

ephraimbuddy Mar 16, 2023

ephraimbuddy Mar 16, 2023

RNHTTR Mar 17, 2023

RNHTTR Mar 17, 2023

jedcunningham Mar 20, 2023

jedcunningham Mar 20, 2023

jedcunningham Mar 20, 2023

jedcunningham Mar 20, 2023

ashb Mar 21, 2023 •

edited

Loading

RNHTTR Mar 21, 2023

collinmcnulty Mar 22, 2023

collinmcnulty commented Mar 27, 2023

RNHTTR commented Mar 28, 2023 •

edited

Loading

		queued for longer than `self._task_queued_timeout` as failed. If the task has
		available retries, it will be retried.

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

Conversation

RNHTTR commented Mar 14, 2023

collinmcnulty commented Mar 14, 2023

RNHTTR commented Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashb Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

collinmcnulty commented Mar 27, 2023

RNHTTR commented Mar 28, 2023 • edited Loading

RNHTTR commented Mar 14, 2023 •

edited

Loading

ashb Mar 21, 2023 •

edited

Loading

RNHTTR commented Mar 28, 2023 •

edited

Loading