-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery tasks stuck in queued state after worker crash (Set changed size during iteration) #29833
Comments
Can you extract the ti state from the DB for a task stuck in queued state? SELECT *
FROM task_instance
WHERE dag_id='<dag id>' AND task_id='<task id>' AND execution_date='<execution date>' |
@hussein-awala here is the db extract
|
I think this can happen when the Airflow worker lost completely the access to the Metadata, where in this case it will not be able to update the task state, and if we use it as a result backend for Celery commands, then the Celery workers will not be able to send events to the DB which the scheduler processes to re-queue the failed tasks. I'm not sure if we check for workers heartbeats, where I didn't find this check in the method |
I am not THAT knowledgeable about this part, so take it with a grain of salt, so let me explain how I understand what's going on. @ephraimbuddy @ashb - maybe you can take a look and confirm if my understanding is correct/wrong? Everything related to managing celery task state happens in the Celery Executor. Eventually - if the task does not update its state (when for example worker crashed, then it should be rescheduled as stalled (by own executor) or adopted (by another one). That's how much details I know from the top of my head. There are a few race conditions that might occur (no distributed system is ever fool proof) and I think the original design is that eventually even if a very nasty race condition happens, the tasks will eventually be rescheduled. |
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
2.5.0
What happened
RuntimeError: Set changed size during iteration
)Following log and screenshot shows a more recent example of the situation above, where 5) has not reached 14h yet. Though we've observed a few such situations recently.
What you think should happen instead
A. Celery worker should reconnect to the database in case of intermittent network errors
B. In case of unrecoverable errors, scheduler should eventually retry or fail the task.
How to reproduce
Difficult, happens intermittently.
Operating System
Ubuntu 22.04
Versions of Apache Airflow Providers
apache-airflow==2.5.0
apache-airflow-providers-celery==3.1.0
redis==3.5.3
celery==5.2.7
kombu==5.2.4
Deployment
Other Docker-based deployment
Deployment details
airflow.cfg, celery options
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: