-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve clear_not_launched_queued_tasks call duration #34985
Improve clear_not_launched_queued_tasks call duration #34985
Conversation
Please consider reformatting the commit message so it fits the standard style. |
72d3669
to
72f5f89
Compare
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Tzu-ping Chung <[email protected]>
ed4ca77
to
144f9d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will slightly increase memory usage, especially if parallelism is set to a high value, but it will have a significant positive impact on scheduling time and scheduler performance.
Overall it looks good, but I prefer to test it first, I will try to do that ASAP.
@hussein-awala Did you get a chance to test it? Any findings? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-checked the code, and I tested the change with Breeze and with a production deployment with 32 pods per scheduler; all looks good.
@uranusjr / @jedcunningham, |
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
* Improve clear_not_launched_queued_tasks call duration * Apply suggestions from code review Co-authored-by: Tzu-ping Chung <[email protected]> --------- Co-authored-by: gopal <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]> Co-authored-by: Hussein Awala <[email protected]> Co-authored-by: Elad Kalif <[email protected]>
Problem: Airflow running the clear_not_launched_queued_tasks function on a certain frequency (default 30 seconds). When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the clear_not_launched_queued_tasks function loops through each queued task and checks the corresponding worker pod existence in the Kube cluster. Right this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.
Solution: Use k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod labels. Use this set data structure and identify whether the task associated pod exists or not.
set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]|[execution_date=<execution_date>])
The details for the issue is mentioned in the below ticket #34877
Closes: #34877