Airflow progressive slowness #32928

llamageddon83 · 2023-07-28T23:02:53Z

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are running Airflow on EKS with version 2.5.3. Airflow has been experiencing progressive slowness over a period of 2-3 weeks where DAGs start getting queued without ever executing and leads us to restart the scheduler pod. After the pod restart, problem goes away for a few days and then starts to slowly creep back up.

The pods, the logs and the dashboards all look healthy, the UI shows that no tasks are currently running, and that there are no worker pods alive. The resource usage graphs (CPU, memory) also look what they should if no DAGs are actually executing.

During one such outage, we disabled all the DAGs and marked all the tasks as success just to see if scheduler is able to spin up new worker pods. Scheduler never recovered and we restarted the scheduler pod.

However, there is one dashboard that shows metrics named Executor running tasks and Executor open slots. We noticed that this dashboard was accurately representing the slowness behavior. Over a period of time, number of open slots would decrease and vice versa for running tasks. These two would never reset even when nothing is running during a long period of time which is every day between 10:00 PM to 8:00 AM.

These metrics are coming from base_exeuctor :

        Stats.gauge("executor.open_slots", open_slots)
        Stats.gauge("executor.queued_tasks", num_queued_tasks)
        Stats.gauge("executor.running_tasks", num_running_tasks)

and num_running_tasks is defined as num_running_tasks = len(self.running) in base_executor.

So we enabled some logs from KuberenetesExecutor under this method to see what was in self.running:

    def sync(self) -> None:
        """Synchronize task state."""
      ####
        if self.running:
            self.log.debug("self.running: %s", self.running)  #--> this log
       ###
        self.kube_scheduler.sync()

where self.running is defined as self.running: set[TaskInstanceKey] = set(). The log showed that somehow the tasks that have been completed successfully in the past still exist in self.running. For example, a snippet of the log outputted on the 28th is holding on to the tasks that have already been successfully completed on the 24th and 27th:

**time: Jul 28, 2023 @ 15:07:01.784**
self.running: {TaskInstanceKey(dag_id='flight_history.py', task_id='load_file', run_id=**'manual__2023-07-24T01:06:18+00:00'**, try_number=1, map_index=17), TaskInstanceKey(dag_id='emd_load.py', task_id='processing.emd', run_id='**scheduled__2023-07-25T07:30:00+00:00'**, try_number=1, map_index=-1),

We validated that these tasks have been completed without any issue from the UI and Postgres DB (which we use as the metadata backend).

Once the scheduler pod is restarted, the problem goes away, the metrics in Grafana dashboard reset and tasks start executing.

What you think should happen instead

Airflow's scheduler is keeping a track of currently running tasks and their state in memory. And that state in some cases is not getting cleared. The tasks that have been completed should eventually be cleared from running set in KubernetesExecutor once the worker pod exits.

How to reproduce

Beats me. Our initial assumption was that that is a DAG implementation issue and some particular DAG is misbehaving. But this problem has occurred with all sorts of DAGs, happens for scheduled and manual runs, and is sporadic. Tt here is some edge scenario that causes this to happen. But we are unable to nail it down any further.

Operating System

Debian GNU/ Linux 11 (bullseye)

Versions of Apache Airflow Providers

aiofiles==23.1.0
aiohttp==3.8.4
airflow-dbt>=0.4.0
airflow-exporter==1.5.3
anytree==2.8.0
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http>=2.0.3
apache-airflow-providers-microsoft-mssql==2.1.3
apache-airflow-providers-snowflake>=4.0.4
apache-airflow-providers-hashicorp==3.3.0
apache-airflow-providers-cncf-kubernetes==5.2.2
apache-airflow>=2.2.3
asgiref==3.5.0
Authlib==0.15.5
dbt-snowflake==1.5.2
flatdict==4.0.1
hvac==0.11.2
jsonschema>=4.17.3
pandas==1.3.5
psycopg2-binary==2.9.3
pyOpenSSL==23.1.1
pysftp==0.2.9
pysmbclient==0.1.5
python-gnupg==0.5.0
PyYAML~=5.4.1
requests~=2.26.0
smbprotocol==1.9.0
snowflake-connector-python== 3.0.4
snowflake-sqlalchemy==1.4.7
statsd==3.3.0
py7zr==0.20.5

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow is deployed via helm charts on EKS in AWS. There are two scheduler pods with AIRFLOW__CORE__PARALLELISM set to 10.

Anything else

N/A

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ephraimbuddy · 2023-07-29T05:48:15Z

We have had a minor release since 2.5.3, could you try this with Airflow 2.6.3?

llamageddon83 · 2023-07-30T19:55:15Z

Hi @ephraimbuddy thank you for taking a look at this. Yes I will bump the version and see if the issue persists. It will take about 2-3 weeks to push the change through to production.

llamageddon83 · 2023-08-25T21:25:49Z

Hi @ephraimbuddy. We pushed 2.6.3 to prod last week. And the progressive slowness has started to creep up again. Version bump didn't seem to remedy the issue.

potiuk · 2023-08-27T16:53:24Z

Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference?

I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis.

Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths).

llamageddon83 · 2023-08-29T19:59:27Z

Hi @potiuk
Please find my answers in line below:

Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference?

Yes we have spent quite a bit of time trying to compare logs from different tasks, logs of same tasks from different executions. But there is no oddity we could spot. As far as running state goes, since this happens very sporadically, we don't know what happens to the task state. Whenever we take a look, the tasks have completed and are no longer in the running state.

I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis.

Sorry I wish I could help. We have spent weeks trying to narrow it down to some DAG implementation, some kind of pattern we may have.. but we got nothing..

Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths).

Yes, we have quite a few of them. The deferrable tasks run in a never ending loop. I know you are very busy and go through many discussions, but if you remember, I previously mentioned how we use triggers for streaming here. The triggers run in a never ending loop and they poll an HTTP endpoint for messages. The schedule for the triggers is stacked and they ramp up to their max_active_runs. Another thing I should point out that the tasks that do not get removed don't necessarily have any deferrable operators.

Thank you

patrick-shakudo · 2023-09-07T17:37:34Z

I am having the same issue with airflow 2.6.3. Database and airflow ui shows no tasks are running but the kubernetes executor thinks there are 64 (parallelism=64) running tasks and skips over queuing / running any more tasks.

My initial thoughts on this is it appears that within the kubernetes_executor.py the process_watcher_task sometimes fails to add a finished pod to the result_queue and because of this _change_state is never called on the task instance and thus the task instance is never removed from self.running set of running task instances and in turn the "critical section" of queueing tasks is never attempted.

I have posted more details about it here in the airflow slack: https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1694033145572849

Cricktom · 2023-09-19T08:21:15Z

I am also facing the same issue with airflow 2.6.2. Airflow ui shows no tasks are running but the Celery executor open slots are stuck to zero whereas running slots are stuck at 32 which results skip over queuing/running any more tasks.

AIRFLOW__CORE__PARALLELISM is 32 and the number of workers running is 2 with having AIRFLOW__CELERY__WORKER_CONCURRENCY value 8

I'm not really sure how to reproduce it, because it happens over a period of time, and that too sporadically.

ephraimbuddy · 2023-09-19T14:31:58Z

I am also facing the same issue with airflow 2.6.2. Airflow ui shows no tasks are running but the Celery executor open slots are stuck to zero whereas running slots are stuck at 32 which results skip over queuing/running any more tasks.

AIRFLOW__CORE__PARALLELISM is 32 and the number of workers running is 2 with having AIRFLOW__CELERY__WORKER_CONCURRENCY value 8

I'm not really sure how to reproduce it, because it happens over a period of time, and that too sporadically.

Looks like you have a different issue, can you create a separate issue with more information?

Cricktom · 2023-09-25T09:45:30Z

Thank you @ephraimbuddy for taking a look at this. Here are more details that support the same behavior as mentioned in this bug.

The below image shows the number of tasks running for the celery executor. It slowly goes up and when it reaches the limit, sometimes it doesn't come down until we restart the scheduler.

The below image shows the number of open slots for the celery executor, which follows the same pattern as above because it's inversely proportional to the number of running tasks.

The below image shows the number of tasks running per pool in the airflow webserver UI. Here we can see that the number of running tasks are not in proportion with the celery executor slots and we have observed that after a point Airflow stops scheduling any task at all because it doesn't find any open slot of the executor.

To mitigate this issue, we are restarting the scheduler every hour. Please suggest if we can follow any better approach to overcome the same.

adrianrego · 2023-10-05T12:19:26Z

I'm also seeing this issue on our setup...Airflow 2.7.1, KubernetesExecutor. Running 3 scheduler pods. Looks like things start going downhill after 5 days. Restarting the schedulers gets things moving again.

github-actions · 2023-10-20T00:11:17Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

sa2413 · 2023-10-27T02:59:37Z

any update on this issue? facing this on multiple k8s clusters

potiuk · 2023-10-27T19:46:51Z

No, but if someone would like to spend time on trying to analyse it it would be great. Until then - if you experience it, I recommend restarting airflow periodically. Seems like super-simple solution that has no real bad side-effects and can be implemented in 5 minutes rather than spending days of volunteers trying to understand what's the root cause (plus it could be coming from the deployment issue).

Pragmatically, seems like easy way to solve your problem at least for now if you are eager to get your mutliple clusters.

Another option might be to fund an effort of someone to help to analyse it maybe? @sa2413 - maybe your case/company would lile to sponsor some maintainers for example to investigate it deeply? We've never done anything like that - but if you are eager to get a solution, maybe that's the right way of solving it, I wonder?

dirrao · 2023-12-20T16:48:40Z

I have created a new issue #36335 for celery executor. Let's keep this for Kubernetes executor.

dirrao · 2023-12-22T15:40:49Z

@llamageddon83
We have faced a similar issue in production, identified the root cause, and provided the fix in #36240. If you are looking for an immediate fix, then you can try this patch #36240. As a workaround, you can increase the parallelism to a large number (1OK).

dirrao · 2023-12-28T03:32:06Z

Hi @llamageddon83,
Kubernets executor slots leak fix is available in 7.12.0 version. You can try and see if this fix your issue.
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/changelog.html

ephraimbuddy · 2024-01-03T16:52:44Z

FYI, I updated [scheduler]max_tis_per_dag to 512 in 2.8.0 and saw good improvement in the number of tasks ran per minute

github-actions · 2024-01-31T00:11:30Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2024-02-08T00:11:06Z

This issue has been closed because it has not received response from the issue author.

github-actions · 2024-10-20T00:16:54Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

…hedulers A race condition occurs in the _adopt_completed_pods function when schedulers are running concurrently. _adopt_completed_pods function doesn't keep track of which scheduler went down so it constantly tries to adopt completed pods from normally working schedulers. On Airflow setups with concurrently running schedulers and with a lot of short living DAG's it leads to race condition and open slots leak. You can find detailed analysis of this situation in GitHub issue here (apache#32928 (comment)). The _adopt_completed_pods function was refactored to the _delete_orphaned_completed_pods function, which removes only completed pods that are not bound to running schedulers. Co-authored-by: Vlad Pastushenko <[email protected]>

potiuk · 2024-11-05T16:25:26Z

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

github-actions · 2024-11-30T00:15:45Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

omametrika · 2024-12-10T12:48:20Z

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

I'm on 2.10.3 and have just experienced this very issue.

Edit: ...which went away after another scheduler restart. Very weird. Will monitor and report back if it happens again.

potiuk · 2024-12-10T15:55:37Z

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

I'm on 2.10.3 and have just experienced this very issue.

Edit: ...which went away after another scheduler restart. Very weird. Will monitor and report back if it happens again.

👀 Would be great if could provide some more details of your setup/ obsrved behaviour when it happens again - similar to those observations above. It could be that the issue is similar, but not the same. Ideally - open an new discussion and link it to that one as "related" - it's always easier to mark issues / discussions as duplicates than to split unrelated issues from single thread/discussion.

github-actions · 2024-12-27T00:15:16Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2025-01-07T00:19:19Z

This issue has been closed because it has not received response from the issue author.

github-actions · 2025-01-07T00:19:33Z

This issue has been closed because it has not received response from the issue author.

llamageddon83 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 28, 2023

potiuk added this to the Airflow 2.7.0 milestone Jul 28, 2023

potiuk added area:Scheduler including HA (high availability) scheduler and removed needs-triage label for new issues that we didn't triage yet labels Jul 28, 2023

ephraimbuddy modified the milestones: Airflow 2.7.0, Airflow 2.7.1 Aug 3, 2023

eladkal added pending-response affected_version:2.5 Issues Reported for 2.5 labels Aug 7, 2023

ephraimbuddy modified the milestones: Airflow 2.7.1, Airflow 2.7.2 Aug 28, 2023

ephraimbuddy modified the milestones: Airflow 2.7.2, Airflow 2.7.3 Oct 4, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 20, 2023

potiuk added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Oct 27, 2023

ephraimbuddy modified the milestones: Airflow 2.8.1, Airflow 2.8.2 Jan 15, 2024

ephraimbuddy mentioned this issue Jan 16, 2024

Status of testing of Apache Airflow 2.8.1rc1 #36808

Closed

68 tasks

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 8, 2024

shahar1 removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 25, 2024

shahar1 reopened this Sep 25, 2024

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 20, 2024

shahar1 removed stale Stale PRs per the .github/workflows/stale.yml policy file pending-response labels Oct 23, 2024

potiuk added the pending-response label Nov 5, 2024

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 30, 2024

github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 11, 2024

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow progressive slowness #32928

Airflow progressive slowness #32928

llamageddon83 commented Jul 28, 2023 •

edited

Loading

ephraimbuddy commented Jul 29, 2023

llamageddon83 commented Jul 30, 2023

llamageddon83 commented Aug 25, 2023

potiuk commented Aug 27, 2023

llamageddon83 commented Aug 29, 2023 •

edited

Loading

patrick-shakudo commented Sep 7, 2023 •

edited

Loading

Cricktom commented Sep 19, 2023

ephraimbuddy commented Sep 19, 2023

Cricktom commented Sep 25, 2023

adrianrego commented Oct 5, 2023

github-actions bot commented Oct 20, 2023

sa2413 commented Oct 27, 2023 •

edited

Loading

potiuk commented Oct 27, 2023

dirrao commented Dec 20, 2023

dirrao commented Dec 22, 2023 •

edited

Loading

dirrao commented Dec 28, 2023

ephraimbuddy commented Jan 3, 2024

github-actions bot commented Jan 31, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Oct 20, 2024

potiuk commented Nov 5, 2024

github-actions bot commented Nov 30, 2024

omametrika commented Dec 10, 2024 •

edited

Loading

potiuk commented Dec 10, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 7, 2025

Airflow progressive slowness #32928

Airflow progressive slowness #32928

Comments

llamageddon83 commented Jul 28, 2023 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

ephraimbuddy commented Jul 29, 2023

llamageddon83 commented Jul 30, 2023

llamageddon83 commented Aug 25, 2023

potiuk commented Aug 27, 2023

llamageddon83 commented Aug 29, 2023 • edited Loading

patrick-shakudo commented Sep 7, 2023 • edited Loading

Cricktom commented Sep 19, 2023

ephraimbuddy commented Sep 19, 2023

Cricktom commented Sep 25, 2023

adrianrego commented Oct 5, 2023

github-actions bot commented Oct 20, 2023

sa2413 commented Oct 27, 2023 • edited Loading

potiuk commented Oct 27, 2023

dirrao commented Dec 20, 2023

dirrao commented Dec 22, 2023 • edited Loading

dirrao commented Dec 28, 2023

ephraimbuddy commented Jan 3, 2024

github-actions bot commented Jan 31, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Oct 20, 2024

potiuk commented Nov 5, 2024

github-actions bot commented Nov 30, 2024

omametrika commented Dec 10, 2024 • edited Loading

potiuk commented Dec 10, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 7, 2025

llamageddon83 commented Jul 28, 2023 •

edited

Loading

llamageddon83 commented Aug 29, 2023 •

edited

Loading

patrick-shakudo commented Sep 7, 2023 •

edited

Loading

sa2413 commented Oct 27, 2023 •

edited

Loading

dirrao commented Dec 22, 2023 •

edited

Loading

omametrika commented Dec 10, 2024 •

edited

Loading