Leak in Kubernetes Executor Running Tasks Slot Count #35675
Labels
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
Apache Airflow version
main (development)
What happened
Schedulers are racing for pod adoption when there is a delay in schedulers' heartbeats. However, the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing, etc. This leads to a leak in the executor.running_tasks slots. Eventually, the schedulers are not able to launch the pods due to executor.running_tasks=parallelism.
What you think should happen instead
We should remove the entry from the Kubernetes executor running queue when we worker pod deleted / moved to another scheduler.
How to reproduce
Reduce the scheduler_health_check_threshold=5 and orphaned_tasks_check_interval=10 values in the airflow config file
Launch the airflow with two schedulers and try to schedule multiple DAGs with backfill for every 1/5 mins.
Operating System
CentOS 6
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes=7.9.0
Deployment
Other Docker-based deployment
Deployment details
Terraform
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: