Leak in Kubernetes Executor Running Tasks Slot Count #35675

dirrao · 2023-11-16T08:09:00Z

Apache Airflow version

main (development)

What happened

Schedulers are racing for pod adoption when there is a delay in schedulers' heartbeats. However, the schedulers are alive but not dead their heartbeat is delayed due to network timeout or heavy processing, etc. This leads to a leak in the executor.running_tasks slots. Eventually, the schedulers are not able to launch the pods due to executor.running_tasks=parallelism.

What you think should happen instead

We should remove the entry from the Kubernetes executor running queue when we worker pod deleted / moved to another scheduler.

How to reproduce

Reduce the scheduler_health_check_threshold=5 and orphaned_tasks_check_interval=10 values in the airflow config file
Launch the airflow with two schedulers and try to schedule multiple DAGs with backfill for every 1/5 mins.

Operating System

CentOS 6

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes=7.9.0

Deployment

Other Docker-based deployment

Deployment details

Terraform

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

c-thiel · 2023-12-06T10:19:25Z

We had the same problem and only noticed it from Airflow 2.7.3, but it was there before.
Maybe this helps for debugging:

On the left of the red line we had Airflow 2.6.X
On the right side is the update to 2.7.3.

Each line is one scheduler.

Before the update, scheduler open slots continuously went down to even below zero sometimes, but only for some executors.
After the update this is much more distributed between all the executors.

Our Workaround
What helped us is limit the parallelism via pools, and just set AIRFLOW__CORE__PARALLELISM to an arbitrary high number. Also regular deployment restarts helps.

dirrao · 2023-12-06T10:22:31Z

@c-thiel
Yes. This workaround works and we are following the same. I have identified the leaks and am going to open the MR soon.

dirrao added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 16, 2023

dirrao changed the title ~~Leak in Kubernetes Executor Running Tasks SLOT Count~~ Leak in Kubernetes Executor Running Tasks Slot Count Nov 16, 2023

This was referenced Dec 15, 2023

Kubernetes executor running slots leak fix #36240

Merged

Airflow progressive slowness #32928

Closed

potiuk closed this as completed in #36240 Dec 20, 2023

potiuk mentioned this issue Dec 20, 2023

Fix race condition in KubernetesExecutor with concurrently running schedulers #35800

Closed

ephraimbuddy mentioned this issue Jan 16, 2024

Status of testing of Apache Airflow 2.8.1rc1 #36808

Closed

68 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leak in Kubernetes Executor Running Tasks Slot Count #35675

Leak in Kubernetes Executor Running Tasks Slot Count #35675

dirrao commented Nov 16, 2023

c-thiel commented Dec 6, 2023

dirrao commented Dec 6, 2023

Leak in Kubernetes Executor Running Tasks Slot Count #35675

Leak in Kubernetes Executor Running Tasks Slot Count #35675

Comments

dirrao commented Nov 16, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

c-thiel commented Dec 6, 2023

dirrao commented Dec 6, 2023