Improve clear_not_launched_queued_tasks call duration #34985

dirrao · 2023-10-17T01:52:13Z

Problem: Airflow running the clear_not_launched_queued_tasks function on a certain frequency (default 30 seconds). When we run the airflow on a large Kube cluster (pods more than > 5K). Internally the clear_not_launched_queued_tasks function loops through each queued task and checks the corresponding worker pod existence in the Kube cluster. Right this existence check using list pods Kube API. The API is taking more than 1s. if there are 120 queued tasks, then it will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most of its time in this function rather than scheduling the tasks. It leads to none of the jobs being scheduled or degraded scheduler performance.

Solution: Use k8 list pods batch api call to get all the worker pod owned by scheduler. Prepare the set of searchable strings using pod labels. Use this set data structure and identify whether the task associated pod exists or not.

set elements string format:
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]|[execution_date=<execution_date>])

The details for the issue is mentioned in the below ticket #34877

Closes: #34877

uranusjr · 2023-10-18T04:19:59Z

Please consider reformatting the commit message so it fits the standard style.

https://cbea.ms/git-commit/#seven-rules

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py

Co-authored-by: Tzu-ping Chung <[email protected]>

hussein-awala

This PR will slightly increase memory usage, especially if parallelism is set to a high value, but it will have a significant positive impact on scheduling time and scheduler performance.

Overall it looks good, but I prefer to test it first, I will try to do that ASAP.

dirrao · 2023-10-22T02:31:30Z

@hussein-awala Did you get a chance to test it? Any findings?

…zation

hussein-awala

I re-checked the code, and I tested the change with Breeze and with a production deployment with 32 pods per scheduler; all looks good.

dirrao · 2023-10-28T00:40:52Z

@uranusjr / @jedcunningham,
can you review and merge it?

…zation

boring-cyborg · 2023-11-01T09:36:21Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

* Improve clear_not_launched_queued_tasks call duration * Apply suggestions from code review Co-authored-by: Tzu-ping Chung <[email protected]> --------- Co-authored-by: gopal <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]> Co-authored-by: Hussein Awala <[email protected]> Co-authored-by: Elad Kalif <[email protected]>

dirrao requested review from jedcunningham and hussein-awala as code owners October 17, 2023 01:52

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Oct 17, 2023

dirrao force-pushed the 34877-clear_not_launched_queued_tasks_optimization branch from 72d3669 to 72f5f89 Compare October 18, 2023 04:38

uranusjr changed the title ~~clear_not_launched_queued_tasks optimization using kube list pods bat…~~ Improve clear_not_launched_queued_tasks call duration Oct 18, 2023

uranusjr reviewed Oct 18, 2023

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

uranusjr reviewed Oct 18, 2023

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

uranusjr reviewed Oct 18, 2023

View reviewed changes

airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py Outdated Show resolved Hide resolved

dirrao requested a review from uranusjr October 18, 2023 06:19

gopal and others added 2 commits October 18, 2023 10:13

Improve clear_not_launched_queued_tasks call duration

7c887f9

Apply suggestions from code review

144f9d1

Co-authored-by: Tzu-ping Chung <[email protected]>

potiuk force-pushed the 34877-clear_not_launched_queued_tasks_optimization branch from ed4ca77 to 144f9d1 Compare October 18, 2023 08:13

hussein-awala reviewed Oct 18, 2023

View reviewed changes

dirrao requested a review from hussein-awala October 24, 2023 16:18

Merge branch 'main' into 34876-clear_not_launched_queued_tasks_optimi…

edbd7eb

…zation

hussein-awala approved these changes Oct 27, 2023

View reviewed changes

hussein-awala requested a review from eladkal October 31, 2023 08:27

eladkal approved these changes Nov 1, 2023

View reviewed changes

Merge branch 'main' into 34877-clear_not_launched_queued_tasks_optimi…

15e79c8

…zation

eladkal merged commit 3724a02 into apache:main Nov 1, 2023
50 checks passed

eladkal mentioned this pull request Nov 8, 2023

Status of testing Providers that were prepared on November 08, 2023 #35540

Closed

60 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve clear_not_launched_queued_tasks call duration #34985

Improve clear_not_launched_queued_tasks call duration #34985

dirrao commented Oct 17, 2023 •

edited by hussein-awala

Loading

uranusjr commented Oct 18, 2023

hussein-awala left a comment

dirrao commented Oct 22, 2023

hussein-awala left a comment

dirrao commented Oct 28, 2023 •

edited

Loading

boring-cyborg bot commented Nov 1, 2023

Improve clear_not_launched_queued_tasks call duration #34985

Improve clear_not_launched_queued_tasks call duration #34985

Conversation

dirrao commented Oct 17, 2023 • edited by hussein-awala Loading

uranusjr commented Oct 18, 2023

hussein-awala left a comment

Choose a reason for hiding this comment

dirrao commented Oct 22, 2023

hussein-awala left a comment

Choose a reason for hiding this comment

dirrao commented Oct 28, 2023 • edited Loading

boring-cyborg bot commented Nov 1, 2023

dirrao commented Oct 17, 2023 •

edited by hussein-awala

Loading

dirrao commented Oct 28, 2023 •

edited

Loading