Tasks stuck in queued despite stalled_task_timeout #28022

RNHTTR · 2022-11-30T23:17:23Z

RNHTTR
Nov 30, 2022
Collaborator

Note: This has cropped up in at least 2.3.x and remains in 2.4.3. The links to Airflow source code are from the 2.3.1 release.

It seems what’s happening is the airflow tasks run <task> command is failing on the Celery worker:

airflow.exceptions.AirflowException: Celery command failed on host: <host> with celery_task_id 20ec4a6d-21b4-4838-b7f3-fb5d52c538ee

The Celery status is set to failed , but the task in Airflow remains in queued for some arbitrary amount of time (often hours):

{scheduler_job.py:599} INFO - Executor reports execution of <task> run_id=scheduled__2022-10-26T23:00:00+00:00 exited with status failed for try_number 1

{scheduler_job.py:642} INFO - TaskInstance Finished: dag_id=<dag id>, task_id=<task id>, run_id=scheduled__2022-10-26T23:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=2, job_id=None, pool=default_pool, queue=default, priority_weight=3, operator=DummyOperator, queued_dttm=2022-10-27 00:09:00.545894+00:00, queued_by_job_id=2664047, pid=None

Note the state=queued and executor_state=failed -- Airflow should be marking the task as failed. When this happens, these tasks also bypass stalled_task_timeout, because when update_task_state is called, the celery state is STARTED. self._set_celery_pending_task_timeout(key, None) removes the task from the list of tasks eligible for stalled_task_timeout, and so these tasks remain in queued indefinitely.

Summary of what's happening:

CeleryExecutor’s update_task_state method calls fail(), which is a method from BaseExecutor.
BaseExecutor's fail calls CeleryExecutor’s change_state method.
CeleryExecutor’s change_state method calls BaseExecutor’s change_state method via super()
The crux: BaseExecutor’s change_state method is as follows:

self.log.debug("Changing state: %s", key)
try:
    self.running.remove(key)
except KeyError:
    self.log.debug('Could not find key: %s', str(key))

Because the airflow tasks run command failed, the task is never set to the running state. The except KeyError block allows the code to continue unabated. Once BaseExecutor’s change_state method completes, CeleryExecutor’s change_state method completes:

def change_state(self, key: TaskInstanceKey, state: str, info=None) -> None:
    super().change_state(key, state, info)
    self.tasks.pop(key, None)
    self._set_celery_pending_task_timeout(key, None)

self._set_celery_pending_task_timeout(key, None) removes the task from the list of tasks that stalled_task_timeout checks for, allowing the tasks to remain in queued indefinitely.

Instead, when the airflow tasks run command fails, the Airflow task instance should be failed or retried (if applicable).

MatrixManAtYrService · 2022-12-02T04:42:51Z

MatrixManAtYrService
Dec 2, 2022

My original hope was that we could recreate this if the worker container got OOMkilled while it was parsing the dag. My impression was that at this point in time, CeleryExecutor would have said "I got this one" to Airflow, but would not yet have committed to the work fully, such that other instances don't adopt the task if it was then killed.

That didn't quite happen (here's a gist with more detail: https://gist.github.com/MatrixManAtYrService/cab3fddf52fd7188599914f4a2257706 )

The worker logs had this error (the WARNING text is from code I used to trigger the OOMkill)

[2022-12-02 04:07:09,404: INFO/ForkPoolWorker-15] [71528051-e737-4d80-a564-e4716337788c] Executing command in Celery: ['airflow', 'tasks', 'run', 'oom_on_parse', 'do_stuff', 'scheduled__2022-12-02T04:06:39.252480+00:00', '--local', '--subdir', 'DAGS_FOLDER/oom_on_parse.py']
[2022-12-02 04:07:09,443: INFO/ForkPoolWorker-15] Filling up the DagBag from /opt/airflow/dags/oom_on_parse.py
[2022-12-02 04:07:09,513: WARNING/ForkPoolWorker-15] {'airflow-scheduler-f4ffdc478-wjmmj': 1, 'airflow-worker-0': 2}
[2022-12-02 04:07:09,513: WARNING/ForkPoolWorker-15] boom
[2022-12-02 04:07:18,202: ERROR/ForkPoolWorker-15] Task airflow.executors.celery_executor.execute_command[71528051-e737-4d80-a564-e4716337788c] raised unexpected: AirflowException('Celery command failed on host: airflow-worker-0.airflow-worker.trial.svc.cluster.local with celery_task_id 71528051-e737-4d80-a564-e4716337788c')
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 96, in execute_command
    _execute_in_fork(command_to_exec, celery_task_id)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 111, in _execute_in_fork
    raise AirflowException(msg)
airflow.exceptions.AirflowException: Celery command failed on host: airflow-worker-0.airflow-worker.trial.svc.cluster.local with celery_task_id 71528051-e737-4d80-a564-e4716337788c
[2022-12-02 04:07:39,649: INFO/MainProcess] Task airflow.executors.celery_executor.execute_command[aeeed6b7-d1a1-4f4d-9233-58710403f1b2] received
[2022-12-02 04:07:39,657: INFO/ForkPoolWorker-15] [aeeed6b7-d1a1-4f4d-9233-58710403f1b2] Executing command in Celery: ['airflow', 'tasks', 'run', 'oom_on_parse', 'do_stuff', 'scheduled__2022-12-02T04:07:09.252480+00:00', '--local', '--subdir', 'DAGS_FOLDER/oom_on_parse.py']

And the task_instance table looked like this:

                   run_id                    |          start_date           |          queued_dttm          |           end_date
---------------------------------------------+-------------------------------+-------------------------------+-------------------------------
 scheduled__2022-12-02T04:06:09.252480+00:00 | 2022-12-02 04:06:56.983768+00 | 2022-12-02 04:06:56.467858+00 | 2022-12-02 04:06:57.185558+00
 scheduled__2022-12-02T04:06:39.252480+00:00 |                               | 2022-12-02 04:07:09.392823+00 | 2022-12-02 04:07:18.336951+00

And the scheduler logs had this error:

[2022-12-02T04:07:18.325+0000] {scheduler_job.py:594} INFO - Executor reports execution of oom_on_parse.do_stuff run_id=scheduled__2022-12-02T04:06:39.252480+00:00 exited with status failed for try_number 1
[2022-12-02T04:07:18.329+0000] {scheduler_job.py:651} INFO - TaskInstance Finished: dag_id=oom_on_parse, task_id=do_stuff, run_id=scheduled__2022-12-02T04:06:39.252480+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=_PythonDecoratedOperator, queued_dttm=2022-12-02 04:07:09.392823+00:00, queued_by_job_id=2, pid=None
[2022-12-02T04:07:18.330+0000] {scheduler_job.py:673} ERROR - Executor reports task instance <TaskInstance: oom_on_parse.do_stuff scheduled__2022-12-02T04:06:39.252480+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-12-02T04:07:18.332+0000] {taskinstance.py:1853} ERROR - Executor reports task instance <TaskInstance: oom_on_parse.do_stuff scheduled__2022-12-02T04:06:39.252480+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-12-02T04:07:18.585+0000] {taskinstance.py:1406} INFO - Marking task as FAILED. dag_id=oom_on_parse, task_id=do_stuff, execution_date=20221202T040639, start_date=, end_date=20221202T040718
[2022-12-02T04:07:19.497+0000] {dagrun.py:578} ERROR - Marking run <DagRun oom_on_parse @ 2022-12-02 04:06:39.252480+00:00: scheduled__2022-12-02T04:06:39.252480+00:00, state:running, queued_at: 2022-12-02 04:07:09.345706+00:00. externally triggered: False> failed
[2022-12-02T04:07:19.498+0000] {dagrun.py:659} INFO - DagRun Finished: dag_id=oom_on_parse, execution_date=2022-12-02 04:06:39.252480+00:00, run_id=scheduled__2022-12-02T04:06:39.252480+00:00, run_start_date=2022-12-02 04:07:09.362767+00:00, run_end_date=2022-12-02 04:07:19.498042+00:00, run_duration=10.135275, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2022-12-02 04:06:39.252480+00:00, data_interval_end=2022-12-02 04:07:09.252480+00:00, dag_hash=bc2c2a8f1c4b40f365d0a5f915c18a34
[2022-12-02T04:07:19.499+0000] {dagrun.py:837} WARNING - Failed to record first_task_scheduling_delay metric:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/dagrun.py", line 825, in _emit_true_scheduling_delay_stats_for_finished_state
    first_start_date = ordered_tis_by_start_date[0].start_date
IndexError: list index out of range
[2022-12-02T04:07:19.503+0000] {dag.py:3340} INFO - Setting next_dagrun for oom_on_parse to 2022-12-02T04:07:09.252480+00:00, run_after=2022-12-02T04:07:39.252480+00:00

So if a single task gets killed because it ran out of memory while the dag was parsing, airflow seems to recover as expected. The tasks are set to "failed" immediately, they don't stay "queued" as described above.

Next I'll try it again with more than one task. Perhaps if Task C causes the OOMkill while parsing, the timing will be right for Task B to hit the stuck-queued window, or something like that.

0 replies

MatrixManAtYrService · 2022-12-02T06:56:02Z

MatrixManAtYrService
Dec 2, 2022

I've been playing with parameters, trying to catch this. Tried many things, for instance:

added a second worker replica
now there are 64 tasks every 30 seconds
OOMkill happens both during parsing (periodically) and randomly (1/256 of the time) during task execution

Here's the updated code: https://gist.github.com/MatrixManAtYrService/6e90a3b8c7c65b8d8b1deaccc8b6f042

I still can't get it to become "stuck". Under load, the transition from queued to failed takes a bit longer, but never hours. The longest I managed to get it was just over a minute. I imagine that increasing the load on the scheduler would stretch this out a bit longer, but I don't know about stretching it all the way to "stuck queued".

I have noticed something about the way the UI reports the duration. This one claims to have taken 31 seconds to fail. That's at least reasonable.

This one says it took 23:59:20l. I think it was probably just 00:00:40.

I don't know whether this is part of our bug, or a distraction.

0 replies

potiuk · 2022-12-05T01:59:45Z

potiuk
Dec 5, 2022
Collaborator

Likely good candidate to an issue. Detailed enough, have some logs etc.

It might take time for someone to go deeper why and fix - but at least will be more prominent than discussion

2 replies

RNHTTR Dec 5, 2022
Collaborator Author

I agree. I initially planned to open this as an issue, but the issue template for a Bug Report says:

If you are not able to provide a reproducible case, please open a discussion instead.

potiuk Dec 5, 2022
Collaborator

And this was the right thing to do - at the moment you were reporting it, it qualified for discussion. And since then, we have a lot more information provided.

So after we discussed it (discussion was needed) , we came to conclusion that even if it is not clearly reproducible, it has enough level of details to proceed as an issue (with link to that discussion).

I could even convert it to an issue, but this misses a lot of history (it only converts the original description). What you can do is to digest things from this discussion and create an issue with more context and information.

Everything is perfectly fine here.

wolfier · 2023-03-01T19:01:12Z

wolfier
Mar 1, 2023

Linking Issue: #28120

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks stuck in queued despite stalled_task_timeout #28022

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tasks stuck in queued despite stalled_task_timeout #28022

RNHTTR Nov 30, 2022 Collaborator

Replies: 4 comments · 2 replies

MatrixManAtYrService Dec 2, 2022

MatrixManAtYrService Dec 2, 2022

potiuk Dec 5, 2022 Collaborator

RNHTTR Dec 5, 2022 Collaborator Author

potiuk Dec 5, 2022 Collaborator

wolfier Mar 1, 2023

RNHTTR
Nov 30, 2022
Collaborator

Replies: 4 comments 2 replies

MatrixManAtYrService
Dec 2, 2022

MatrixManAtYrService
Dec 2, 2022

potiuk
Dec 5, 2022
Collaborator

RNHTTR Dec 5, 2022
Collaborator Author

potiuk Dec 5, 2022
Collaborator

wolfier
Mar 1, 2023