Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leak in Kubernetes Executor Running Tasks Slot Count #35675

Closed
2 tasks done
dirrao opened this issue Nov 16, 2023 · 2 comments · Fixed by #36240
Closed
2 tasks done

Leak in Kubernetes Executor Running Tasks Slot Count #35675

dirrao opened this issue Nov 16, 2023 · 2 comments · Fixed by #36240
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@dirrao
Copy link
Contributor

dirrao commented Nov 16, 2023

Apache Airflow version

main (development)

What happened

Schedulers are racing for pod adoption when there is a delay in schedulers' heartbeats. However, the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing, etc. This leads to a leak in the executor.running_tasks slots. Eventually, the schedulers are not able to launch the pods due to executor.running_tasks=parallelism.

What you think should happen instead

We should remove the entry from the Kubernetes executor running queue when we worker pod deleted / moved to another scheduler.

How to reproduce

Reduce the scheduler_health_check_threshold=5 and orphaned_tasks_check_interval=10 values in the airflow config file
Launch the airflow with two schedulers and try to schedule multiple DAGs with backfill for every 1/5 mins.

Operating System

CentOS 6

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes=7.9.0

Deployment

Other Docker-based deployment

Deployment details

Terraform

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dirrao dirrao added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 16, 2023
@dirrao dirrao changed the title Leak in Kubernetes Executor Running Tasks SLOT Count Leak in Kubernetes Executor Running Tasks Slot Count Nov 16, 2023
@c-thiel
Copy link
Contributor

c-thiel commented Dec 6, 2023

We had the same problem and only noticed it from Airflow 2.7.3, but it was there before.
Maybe this helps for debugging:

On the left of the red line we had Airflow 2.6.X
On the right side is the update to 2.7.3.

Each line is one scheduler.

image

Before the update, scheduler open slots continuously went down to even below zero sometimes, but only for some executors.
After the update this is much more distributed between all the executors.

Our Workaround
What helped us is limit the parallelism via pools, and just set AIRFLOW__CORE__PARALLELISM to an arbitrary high number. Also regular deployment restarts helps.

@dirrao
Copy link
Contributor Author

dirrao commented Dec 6, 2023

@c-thiel
Yes. This workaround works and we are following the same. I have identified the leaks and am going to open the MR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants