-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks are in queued state for a longer time and executor slots are exhausted often #38968
Comments
Without any logs, errors, metrics or details it is impossible to (1) understand your problem and (2) fix anything. Can you please describe more details? |
Apologies, I'm relatively new to Airflow. We've checked the scheduler logs thoroughly, and everything seems to be functioning correctly without any errors. Additionally, the scheduler pods are operating within normal CPU and memory limits. Our database, RDS, doesn't indicate any breaches either. Currently, we're running a DAG with 150 parallel DAG runs. However, a significant portion of tasks are remaining in a queued state for an extended period. Specifically, about 140 tasks are queued, while only 39 are actively running. I've already reviewed the configurations for max_active_tasks_per_dag and max_active_runs_per_dag, and they appear to be properly set. We did not face this issue in 2.3.3 |
Can you try increasing the |
I have updated the config map with |
@ephraimbuddy , The above config has improved the performance in scheduling the tasks and the gantt view shows the tasks queue time is lesser than before. Also , please share performance tuning documentation that will be really nice of you. |
@ephraimbuddy , Also saw that the dags were in scheduled state , after restarting the scheduler everything works fine now. Found that the executor was showing no open slots available, attaching the image of the metrics |
We got the same issues for twice. Same observation, this happened when executor open slots < 0. |
@jscheffl , Can you remove the pending response label. |
After reviewing various GitHub and Stack Overflow discussions, I've made updates to the following configuration and migrated to
Disabled gitsync. |
We have also observed that the pods are not cleaned up after completion of the task and all the pods are stuck in SUCCEEDED state. |
Sorry , the above comment is false positive. We are customizing our KPO and we missed to add |
@paramjeet01 |
This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time. |
@dirrao , airflow num_runs configuration parameter purpose has been changed a while ago AFAIK and cannot be used for restarting the scheduler. We have also removed run_duration which was later used for restarting the scheduler. |
If I understood this correctly, the performance issues with tasks in the queued state were mitigated by adjusting max_tis_per_query, scaling scheduler replicas, and implementing periodic scheduler restarts. @paramjeet01 tried periodic restarts of all scheduler pods to temporarily resolve the issue. |
Can anyone try this patch #40183 for the scheduler restarting issue? |
Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @paramjeet01 ? |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.3
What happened?
The tasks are in queued state for longer time than expected. This was working fine in 2.3.3 perfectly.
What you think should happen instead?
The tasks should be in running state instead being queued.
How to reproduce
Spin up more than 150 dag run in parallel and the tasks gets queued while it can execute in airflow 2.8.3
Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: