Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

okayhooni · 2023-01-14T15:59:43Z

related: #ISSUE

closes: Tasks stuck in queued state #28714
closes: Tasks stuck in queued state #28206
closes: Tasks stuck in queued state #21225
closes: Tasks stuck in queued despite stalled_task_timeout #28120

We have been using Airflow with Celery executor on EKS to schedule up to 8,000 task instances per day on production data pipeline, and suffered from task instances stuck in queued status.

On average, we found 1~3 task instances stuck in queued states per day, and those tasks did not conform execution_timeout or write any task logs, because they had not been launched by any celery worker!

I found logic for handling similar issues on KubernetesExecutor's method: clear_not_launched_queued_tasks()

https://github.com/apache/airflow/blob/main/airflow/executors/kubernetes_executor.py#L495

So, I had made a custom DAG and task to perform similar operation(find tasks stuck in queued status, and re-schedule those by updating state field to scheduled with sqlalchemy query (like {TaskInstance.state: State.SCHEDULED})

this DAG works very well for the last few months, but on last Friday, we got very funny situation that task instance of this custom DAG was stuck in queued state. so, I guess this checking logic will be better if it is located on the executor code side with airflow built-in sched based scheduler(EventScheduler), like KubernetesExecutor.

If we know the root cause of this problem, there will be more elegant solution. but, I read some related issues linked above, and guess there is not just one reason to make stuck tasks, case by case. (Some parts of this issue is marked as fixed, but the issues remains consistently to latest release (v2.5)

Tasks stuck in queued state #21225 (comment)

uranusjr · 2023-01-16T08:53:52Z

airflow/executors/celery_executor.py

+        if self.task_stuck_in_queued_check_interval:
+            self.event_scheduler = EventScheduler()
+
+            self.event_scheduler.call_regular_interval(
+                self.task_stuck_in_queued_check_interval,
+                self.clear_not_launched_queued_tasks,
+            )


I don’t think we need an extra scheduler for this, but can probably check during a scheduler loop iteration. Probably during _critical_section_enqueue_task_instances to avoid race conditions (a scheduler tries to schedule a task that’s cleared in another).

tanelk · 2023-02-21T09:36:04Z

Doesn't the stalled_task_timeout configuration in the [celery] block do exactly this? It defaults to 0, meaning it is disabled by default.

collinmcnulty · 2023-02-23T20:37:43Z

@tanelk It is possible for tasks to be stuck in queued for reasons other than those checked in stalled_task_timeout. This seems like it would be more general purpose. See #28120

RNHTTR · 2023-03-02T03:05:41Z

Should this instead set the task state to FAILED instead of SCHEDULED to avoid a possible infinite loop of scheduled -> queued -> scheduled -> queued?
Should there instead be something much simpler than this and/or stalled_task_timeout?
1. On each scheduler loop, query for queued tasks whose queued_dttm is greater than {some config setting}, and fail those tasks

Tonkonozhenko · 2023-03-02T13:13:16Z

@okayhooni could you share your DAG that restarts tasks? Unfortunately, we face the same issue.

okayhooni · 2023-03-08T05:50:21Z

@Tonkonozhenko
sorry for the late response. I implemented simple killer operator like below. (we currently use OpsGenie for the alert)

from datetime import datetime, timedelta
from textwrap import dedent
from typing import Any, Dict, List, Optional

import pytz
from airflow.models import BaseOperator
from airflow.models.taskinstance import TaskInstance
from airflow.providers.opsgenie.hooks.opsgenie import OpsgenieAlertHook
from airflow.utils.session import NEW_SESSION, provide_session
from airflow.utils.state import State
from sqlalchemy.orm import Session

DEFAULT_ZOMBIE_THRESHOLD_TIMEDELTA = timedelta(minutes=30)
DEFAULT_ZOMBIE_MARKING_STATE = State.FAILED
DEFAULT_ZOMBIE_OPSGENIE_ALERT_PRIORITY = "P3"


class ZombieTaskKillerOperator(BaseOperator):
    """
    Marking zombie tasks stuck on QUEUED state as specified state(default to FAILED)
    related issues: https://github.com/apache/airflow/issues?q=is%3Aissue+is%3Aopen+stuck+queued

    :param zombie_threshold_timedelta: target threshold timedelta for deciding whether this task is zombie or not
    :param zombie_marking_state: marking state of detected queued zombie tasks, default to State.FAILED (= "failed")
    :param only_on_current_dag: whether kill tasks only on the current DAG or not, default to True
    :param opsgenie_conn_id: Airflow connection ID for alerting with OpsGenie. if not provided, there is no alert
    :param opsgenie_priority: The priority of the OpsGenie alert. possible values are: P1, P2, P3, P4, P5, default to P3
    """

    def __init__(
        self,
        *args,
        zombie_threshold_timedelta: timedelta = DEFAULT_ZOMBIE_THRESHOLD_TIMEDELTA,
        zombie_marking_state: State = DEFAULT_ZOMBIE_MARKING_STATE,
        only_on_current_dag: bool = True,
        opsgenie_conn_id: Optional[str] = None,
        opsgenie_priority: str = DEFAULT_ZOMBIE_OPSGENIE_ALERT_PRIORITY,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.zombie_threshold_timedelta = zombie_threshold_timedelta
        self.zombie_marking_state = zombie_marking_state
        self.only_on_current_dag = only_on_current_dag
        self.opsgenie_conn_id = opsgenie_conn_id
        self.opsgenie_priority = opsgenie_priority

    def execute(self, context: Dict[str, Any]) -> None:
        """execute method for ZombieTaskKillerOperator"""
        queued_zombies = self._find_and_kill_queued_zombies()

        if queued_zombies and self.opsgenie_conn_id:
            self._alert_queued_zombies(zombie_tasks=queued_zombies, log_url=context["task_instance"].log_url)

    @provide_session
    def _find_and_kill_queued_zombies(
        self, *, session: Session = NEW_SESSION
    ) -> Optional[List[TaskInstance]]:
        """
        find and kill queued zombie with state update on Airflow meta database

        :param session: Sqlalchemy ORM Session (provided on default)
        :return: list of zombie task instances, returned from sqlalchemy ORM query, if exist
        """
        cur_utc_datetime = datetime.now(pytz.utc)
        self.log.info(f"[cur_utc_datetime]: {cur_utc_datetime}")

        queued_zombie_target_filters = [
            TaskInstance.state == State.QUEUED,
            TaskInstance.queued_dttm <= cur_utc_datetime - self.zombie_threshold_timedelta,
        ]  # queued_dttm field is loaded on metadb with UTC timezone

        if self.only_on_current_dag:
            queued_zombie_target_filters.append(TaskInstance.dag_id == self.dag_id)

        queued_zombies_query = session.query(
            TaskInstance.dag_id,
            TaskInstance.run_id,
            TaskInstance.task_id,
            TaskInstance.operator,
            TaskInstance.state,
            TaskInstance.queued_dttm,
            TaskInstance.queue,
            TaskInstance.pool,
        ).filter(*queued_zombie_target_filters)

        queued_zombies = queued_zombies_query.all()
        if not queued_zombies:
            self.log.info("CLEAR! CURRENTLY, THERE IS NO QUEUED ZOMBIE FOUND")
            return

        self.log.warning(f"[Queued Zombies]: {queued_zombies}")
        queued_zombies_query.update(
            {TaskInstance.state: self.zombie_marking_state}, synchronize_session="fetch"
        )
        self.log.warning(f"MARKING QUEUED ZOMBIES TO `{self.zombie_marking_state}` STATE")

        return queued_zombies

    def _alert_queued_zombies(self, *, zombie_tasks: List[TaskInstance], log_url: str) -> None:
        """
        alert queued zombie tasks marked as a specified state on OpsGenie

        :param zombie_tasks: list of zombie task instances, returned from sqlalchemy ORM query
        :param log_url: url of this zombie killer task log
        """
        zombie_tasks_description = [
            dedent(
                f"""\
            [CAUGHT ZOMBIE - {idx}]
            *DAG*: {zombie.dag_id}
            *Run ID*: {zombie.run_id}
            *Task*: {zombie.task_id}
            *Queued Time*: {zombie.queued_dttm}\
            """
            )
            for idx, zombie in enumerate(zombie_tasks, 1)
        ]

        alert_msg_json = {
            "message": f"[Airflow] Marking {len(zombie_tasks)} queued zombie task(s) as `{self.zombie_marking_state}`",
            "description": "\n=================================\n".join(zombie_tasks_description),
            "details": {"Logs": log_url},
            "source": "airflow",
            "priority": self.opsgenie_priority,
            "tags": ["airflow"],
        }

        opsgenie_hook = OpsgenieAlertHook(self.opsgenie_conn_id)
        opsgenie_hook.create_alert(alert_msg_json)

Then, I deployed this custom operator on the DAG like below (15 min interval)

    zombie_killer = ZombieTaskKillerOperator(
        task_id="zombie_killer",
        zombie_marking_state=State.SCHEDULED,
        zombie_threshold_timedelta=timedelta(minutes=29),
        only_on_current_dag=False,
        opsgenie_conn_id="ALARM_OPSGENIE_DATAPLATFORM",
        opsgenie_priority="P5",
    )

It's not fancy/graceful solution, but works well.

okayhooni · 2023-03-08T05:54:19Z

@RNHTTR

Should this instead set the task state to FAILED instead of SCHEDULED to avoid a possible infinite loop of scheduled -> queued -> scheduled -> queued?

Previously, I did set the task state to FAILED instead of SCHEDULED. but, we had to clear failed stuck tasks all the time. and decided to set the task state to SCHEDULED, and the side-effect has never happened, so far.

Tonkonozhenko · 2023-03-08T09:16:37Z

thanks, @okayhooni
Also, I set stalled_task_timeout and it helped. probably, you should try it too. It fixes 1-2 tasks per day

RNHTTR · 2023-03-08T16:26:19Z

@Tonkonozhenko This PR is particularly useful for rare cases where stalled_task_timeout doesn't resolve the problem!

potiuk · 2023-03-10T12:21:46Z

This one seems to be able to handle the really often problem that our users seem to have. I do not know that much of the intricacies of our celery integration, but this one seems to add enough of the harness and similar approach that we already use in K8S executor. So maybe those who have more of the celery experience can take a look at it?

WDYT @ashb @ephraimbuddy @uranusjr and maybe also @dstandish for similar case in K8S? I think that one might solve some real pains of the users (though they might not address the root cause of the problems observed).

potiuk · 2023-03-15T22:28:21Z

There some tests to fix though

RNHTTR · 2023-03-16T14:55:42Z

@potiuk IMO we should replace stalled_task_timeout altogether (#30108)

potiuk · 2023-03-16T19:33:32Z

@potiuk IMO we should replace stalled_task_timeout altogether (#30108)

Hard to say. I'd prefer those who get a bit more deep understanding of how celery integration works @ashb @ephraimbuddy @uranusjr to chime in.

RNHTTR · 2023-03-16T19:39:25Z

@potiuk IMO we should replace stalled_task_timeout altogether (#30108)

Hard to say. I'd prefer those who get a bit more deep understanding of how celery integration works @ashb @ephraimbuddy @uranusjr to chime in.

If #30108 is indeed the right path forward for replacing stalled_task_timeout, I suspect it makes sense to also remove task_adoption_timeout.

okayhooni requested review from kaxil, XD-DENG, ashb and o-nikolas as code owners January 14, 2023 15:59

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 14, 2023

uranusjr reviewed Jan 16, 2023

View reviewed changes

add task stuck in queued status handling logic on CeleryExecutor

835da8a

potiuk force-pushed the feature/reschedule_tasks_stuck_in_queued_state branch from b9f6784 to 835da8a Compare March 10, 2023 12:16

RNHTTR mentioned this pull request Mar 14, 2023

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30108

Closed

RNHTTR mentioned this pull request Mar 30, 2023

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout #30375

Merged

ephraimbuddy closed this in #30375 Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

okayhooni commented Jan 14, 2023 •

edited by eladkal

Loading

uranusjr Jan 16, 2023

tanelk commented Feb 21, 2023 •

edited

Loading

collinmcnulty commented Feb 23, 2023

RNHTTR commented Mar 2, 2023 •

edited

Loading

Tonkonozhenko commented Mar 2, 2023

okayhooni commented Mar 8, 2023

okayhooni commented Mar 8, 2023

Tonkonozhenko commented Mar 8, 2023

RNHTTR commented Mar 8, 2023

potiuk commented Mar 10, 2023

potiuk commented Mar 15, 2023

RNHTTR commented Mar 16, 2023

potiuk commented Mar 16, 2023

RNHTTR commented Mar 16, 2023

Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

Add handling logic on CeleryExecutor to reschedule task stuck in queued status #28943

Conversation

okayhooni commented Jan 14, 2023 • edited by eladkal Loading

uranusjr Jan 16, 2023

Choose a reason for hiding this comment

tanelk commented Feb 21, 2023 • edited Loading

collinmcnulty commented Feb 23, 2023

RNHTTR commented Mar 2, 2023 • edited Loading

Tonkonozhenko commented Mar 2, 2023

okayhooni commented Mar 8, 2023

okayhooni commented Mar 8, 2023

Tonkonozhenko commented Mar 8, 2023

RNHTTR commented Mar 8, 2023

potiuk commented Mar 10, 2023

potiuk commented Mar 15, 2023

RNHTTR commented Mar 16, 2023

potiuk commented Mar 16, 2023

RNHTTR commented Mar 16, 2023

okayhooni commented Jan 14, 2023 •

edited by eladkal

Loading

tanelk commented Feb 21, 2023 •

edited

Loading

RNHTTR commented Mar 2, 2023 •

edited

Loading