Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve clear_not_launched_queued_tasks call duration #34985

Merged

Conversation

dirrao
Copy link
Contributor

@dirrao dirrao commented Oct 17, 2023

Problem: Airflow running the clear_not_launched_queued_tasks function on a certain frequency (default 30 seconds). When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the clear_not_launched_queued_tasks function loops through each queued task and checks the corresponding worker pod existence in the Kube cluster. Right this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.

Solution: Use k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod labels. Use this set data structure and identify whether the task associated pod exists or not.

set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]|[execution_date=<execution_date>])

The details for the issue is mentioned in the below ticket #34877

Closes: #34877

@uranusjr
Copy link
Member

Please consider reformatting the commit message so it fits the standard style.

https://cbea.ms/git-commit/#seven-rules

@dirrao dirrao force-pushed the 34877-clear_not_launched_queued_tasks_optimization branch from 72d3669 to 72f5f89 Compare October 18, 2023 04:38
@uranusjr uranusjr changed the title clear_not_launched_queued_tasks optimization using kube list pods bat… Improve clear_not_launched_queued_tasks call duration Oct 18, 2023
@dirrao dirrao requested a review from uranusjr October 18, 2023 06:19
@potiuk potiuk force-pushed the 34877-clear_not_launched_queued_tasks_optimization branch from ed4ca77 to 144f9d1 Compare October 18, 2023 08:13
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will slightly increase memory usage, especially if parallelism is set to a high value, but it will have a significant positive impact on scheduling time and scheduler performance.

Overall it looks good, but I prefer to test it first, I will try to do that ASAP.

@dirrao
Copy link
Contributor Author

dirrao commented Oct 22, 2023

@hussein-awala Did you get a chance to test it? Any findings?

@dirrao dirrao requested a review from hussein-awala October 24, 2023 16:18
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the code, and I tested the change with Breeze and with a production deployment with 32 pods per scheduler; all looks good.

@dirrao
Copy link
Contributor Author

dirrao commented Oct 28, 2023

@uranusjr / @jedcunningham,
can you review and merge it?

@hussein-awala hussein-awala requested a review from eladkal October 31, 2023 08:27
@eladkal eladkal merged commit 3724a02 into apache:main Nov 1, 2023
50 checks passed
Copy link

boring-cyborg bot commented Nov 1, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Nov 10, 2023
* Improve clear_not_launched_queued_tasks call duration

* Apply suggestions from code review

Co-authored-by: Tzu-ping Chung <[email protected]>

---------

Co-authored-by: gopal <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
Co-authored-by: Hussein Awala <[email protected]>
Co-authored-by: Elad Kalif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler is spending most of its time in clear_not_launched_queued_tasks function
4 participants