Tasks stuck in queued state #28206

benrifkind · 2022-12-07T20:59:21Z

Apache Airflow version

2.5.0

What happened

Tasks are getting stuck in the queued state

What you think should happen instead

Tasks should get scheduled and run

How to reproduce

I am using the CeleryExecutor and deploying Airflow on AWS's EKS.

I have 3 DAGs with 10 tasks. Each task is a simple KubernetesPodOperator which just exits when it starts. If I run the Airflow deploy with CELERY__WORKER_CONCURRENCY set to something high like 32, the celery worker will fail and the tasks that were queued up to run on it will enter into a bad state. Even once I set the concurrency lower (16), the tasks continue to not be scheduled. Note that if I set the worker concurrency to 16 on the initial deploy the tasks never get into a bad state and everything works fine.

Clearing the tasks does not even fix the issue. I get this log line in the scheduler

ERROR - could not queue task TaskInstanceKey(dag_id='batch_1', task_id='task_2', run_id='scheduled__2022-01-07T04:05:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

To me it seems like the scheduler thinks the task is still running even though it is not.

Clearing the task and restarting the scheduler seems to do the trick.

Happy to give any more information that would be needed. Tasks getting stuck in queued also sometimes happens in my production environment which is the impetus for this investigation. I'm not sure if it is the same problem but I would like to figure out if this is a bug or just a misconfiguration on my end. Thanks for your help.

Operating System

Debian GNU/Linux

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cncf-kubernetes==5.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

I am using this helm chart to deploy - https://github.com/airflow-helm/charts/tree/main/charts/airflow (v8.6.1)

I know that chart is not supported by Apache Airflow but don't think it's related to the chart. Based on the logs and the solution it seems like an issue with Airflow/Celery.

Anything else

This problem can be replicated each time following the steps I detailed above. Not sure if the way the celery worker fails matters.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-12-07T20:59:22Z

Thanks for opening your first issue here! Be sure to follow the issue template!

uranusjr · 2022-12-08T05:46:57Z

Since it seems the issue can be consistently replicated, have you tried to reduce an localise the source of the issue? For example, does this happen with other operators, or specific to KubernetesPodOperator? Fewer DAGs and/or fewer tasks? On another machine (e.g. your local PC with LocalExecutor)?

benrifkind · 2022-12-08T22:17:22Z

Hi @uranusjr. Thanks for your response.

I believe this is an issue with the CeleryExecutor so I have not tested it with an other Executors.

I checked and this doesn't seem to be an issue specifically with KubernetesPodOperator. I was able to replicate it with the BashOperator.

In terms of number of DAGs and tasks. I was able to replicate this with one DAG with many tasks. I think the issue occurs when a celery worker goes down unexpectedly while it is still responsible for running tasks. So with one celery worker, running one DAG with a lot of tasks and high concurrency creates the problem. Basically the celery worker dies or is killed and once it comes back up the scheduler thinks the tasks are being run so it can't rerun them on this restarted worker. Of course I'm not sure that is what is happening but it's my best guess. I am not sure why restarting the scheduler after clearing the tasks fixes the issue.

Thanks for your help.

Edit: More context and supporting information.

I just ran into this problem in my production deployment. And was able to fix it by clearing the tasks and restarting the scheduler. We restart our entire Airflow deployment by just terminating the instances that our EKS deployment runs on once a week. This terminates the Airflow celery workers without waiting for them to shut down gracefully. And so this problem along with the solution likely matches the problem I documented above.

potiuk · 2022-12-19T13:52:16Z

Something to be investigated then - looks like a problem in handling such exceptional case with celery. Thanks for narrowing it down.. In the meantime - are you sure you want to shitdiwn the deployment that abruptly?

benrifkind · 2023-01-03T19:06:44Z

Yup. That's a good point. This is not a good way to shut down Airflow. I will try to move off this approach to restarting the deployment. Thanks for the feedback.

benrifkind added area:core kind:bug This is a clearly a bug labels Dec 7, 2022

potiuk added this to the Airflow 2.5.1 milestone Dec 19, 2022

apache locked and limited conversation to collaborators Jan 4, 2023

potiuk converted this issue into discussion #28714 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Tasks stuck in queued state #28206

Tasks stuck in queued state #28206

benrifkind commented Dec 7, 2022 •

edited

Loading

boring-cyborg bot commented Dec 7, 2022

uranusjr commented Dec 8, 2022

benrifkind commented Dec 8, 2022 •

edited

Loading

potiuk commented Dec 19, 2022

benrifkind commented Jan 3, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Tasks stuck in queued state #28206

Tasks stuck in queued state #28206

Comments

benrifkind commented Dec 7, 2022 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Dec 7, 2022

uranusjr commented Dec 8, 2022

benrifkind commented Dec 8, 2022 • edited Loading

potiuk commented Dec 19, 2022

benrifkind commented Jan 3, 2023

This issue was moved to a discussion.

benrifkind commented Dec 7, 2022 •

edited

Loading

benrifkind commented Dec 8, 2022 •

edited

Loading