Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow progressive slowness #32928

Closed
2 tasks done
llamageddon83 opened this issue Jul 28, 2023 · 46 comments · Fixed by #36240
Closed
2 tasks done

Airflow progressive slowness #32928

llamageddon83 opened this issue Jul 28, 2023 · 46 comments · Fixed by #36240
Labels
affected_version:2.5 Issues Reported for 2.5 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response pinned Protect from Stalebot auto closing stale Stale PRs per the .github/workflows/stale.yml policy file
Milestone

Comments

@llamageddon83
Copy link

llamageddon83 commented Jul 28, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

We are running Airflow on EKS with version 2.5.3. Airflow has been experiencing progressive slowness over a period of 2-3 weeks where DAGs start getting queued without ever executing and leads us to restart the scheduler pod. After the pod restart, problem goes away for a few days and then starts to slowly creep back up.

The pods, the logs and the dashboards all look healthy, the UI shows that no tasks are currently running, and that there are no worker pods alive. The resource usage graphs (CPU, memory) also look what they should if no DAGs are actually executing.

During one such outage, we disabled all the DAGs and marked all the tasks as success just to see if scheduler is able to spin up new worker pods. Scheduler never recovered and we restarted the scheduler pod.

However, there is one dashboard that shows metrics named Executor running tasks and Executor open slots. We noticed that this dashboard was accurately representing the slowness behavior. Over a period of time, number of open slots would decrease and vice versa for running tasks. These two would never reset even when nothing is running during a long period of time which is every day between 10:00 PM to 8:00 AM.

These metrics are coming from base_exeuctor :

        Stats.gauge("executor.open_slots", open_slots)
        Stats.gauge("executor.queued_tasks", num_queued_tasks)
        Stats.gauge("executor.running_tasks", num_running_tasks)

and num_running_tasks is defined as num_running_tasks = len(self.running) in base_executor.

Screenshot 2023-07-28 at 3 11 30 PM

So we enabled some logs from KuberenetesExecutor under this method to see what was in self.running:

    def sync(self) -> None:
        """Synchronize task state."""
      ####
        if self.running:
            self.log.debug("self.running: %s", self.running)  #--> this log
       ###
        self.kube_scheduler.sync()

where self.running is defined as self.running: set[TaskInstanceKey] = set(). The log showed that somehow the tasks that have been completed successfully in the past still exist in self.running. For example, a snippet of the log outputted on the 28th is holding on to the tasks that have already been successfully completed on the 24th and 27th:

**time: Jul 28, 2023 @ 15:07:01.784**
self.running: {TaskInstanceKey(dag_id='flight_history.py', task_id='load_file', run_id=**'manual__2023-07-24T01:06:18+00:00'**, try_number=1, map_index=17), TaskInstanceKey(dag_id='emd_load.py', task_id='processing.emd', run_id='**scheduled__2023-07-25T07:30:00+00:00'**, try_number=1, map_index=-1), 

We validated that these tasks have been completed without any issue from the UI and Postgres DB (which we use as the metadata backend).

Once the scheduler pod is restarted, the problem goes away, the metrics in Grafana dashboard reset and tasks start executing.

What you think should happen instead

Airflow's scheduler is keeping a track of currently running tasks and their state in memory. And that state in some cases is not getting cleared. The tasks that have been completed should eventually be cleared from running set in KubernetesExecutor once the worker pod exits.

How to reproduce

Beats me. Our initial assumption was that that is a DAG implementation issue and some particular DAG is misbehaving. But this problem has occurred with all sorts of DAGs, happens for scheduled and manual runs, and is sporadic. Tt here is some edge scenario that causes this to happen. But we are unable to nail it down any further.

Operating System

Debian GNU/ Linux 11 (bullseye)

Versions of Apache Airflow Providers

aiofiles==23.1.0
aiohttp==3.8.4
airflow-dbt>=0.4.0
airflow-exporter==1.5.3
anytree==2.8.0
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http>=2.0.3
apache-airflow-providers-microsoft-mssql==2.1.3
apache-airflow-providers-snowflake>=4.0.4
apache-airflow-providers-hashicorp==3.3.0
apache-airflow-providers-cncf-kubernetes==5.2.2
apache-airflow>=2.2.3
asgiref==3.5.0
Authlib==0.15.5
dbt-snowflake==1.5.2
flatdict==4.0.1
hvac==0.11.2
jsonschema>=4.17.3
pandas==1.3.5
psycopg2-binary==2.9.3
pyOpenSSL==23.1.1
pysftp==0.2.9
pysmbclient==0.1.5
python-gnupg==0.5.0
PyYAML~=5.4.1
requests~=2.26.0
smbprotocol==1.9.0
snowflake-connector-python== 3.0.4
snowflake-sqlalchemy==1.4.7
statsd==3.3.0
py7zr==0.20.5

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow is deployed via helm charts on EKS in AWS. There are two scheduler pods with AIRFLOW__CORE__PARALLELISM set to 10.

Anything else

N/A

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@llamageddon83 llamageddon83 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 28, 2023
@potiuk potiuk added this to the Airflow 2.7.0 milestone Jul 28, 2023
@potiuk potiuk added area:Scheduler including HA (high availability) scheduler and removed needs-triage label for new issues that we didn't triage yet labels Jul 28, 2023
@ephraimbuddy
Copy link
Contributor

We have had a minor release since 2.5.3, could you try this with Airflow 2.6.3?

@llamageddon83
Copy link
Author

Hi @ephraimbuddy thank you for taking a look at this. Yes I will bump the version and see if the issue persists. It will take about 2-3 weeks to push the change through to production.

@llamageddon83
Copy link
Author

Hi @ephraimbuddy. We pushed 2.6.3 to prod last week. And the progressive slowness has started to creep up again. Version bump didn't seem to remedy the issue.

@potiuk
Copy link
Member

potiuk commented Aug 27, 2023

Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference?

I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis.

Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths).

@llamageddon83
Copy link
Author

llamageddon83 commented Aug 29, 2023

Hi @potiuk
Please find my answers in line below:

Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference?

Yes we have spent quite a bit of time trying to compare logs from different tasks, logs of same tasks from different executions. But there is no oddity we could spot. As far as running state goes, since this happens very sporadically, we don't know what happens to the task state. Whenever we take a look, the tasks have completed and are no longer in the running state.

I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis.

Sorry I wish I could help. We have spent weeks trying to narrow it down to some DAG implementation, some kind of pattern we may have.. but we got nothing..

Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths).

Yes, we have quite a few of them. The deferrable tasks run in a never ending loop. I know you are very busy and go through many discussions, but if you remember, I previously mentioned how we use triggers for streaming here. The triggers run in a never ending loop and they poll an HTTP endpoint for messages. The schedule for the triggers is stacked and they ramp up to their max_active_runs. Another thing I should point out that the tasks that do not get removed don't necessarily have any deferrable operators.

Thank you

@patrick-shakudo
Copy link

patrick-shakudo commented Sep 7, 2023

I am having the same issue with airflow 2.6.3. Database and airflow ui shows no tasks are running but the kubernetes executor thinks there are 64 (parallelism=64) running tasks and skips over queuing / running any more tasks.

My initial thoughts on this is it appears that within the kubernetes_executor.py the process_watcher_task sometimes fails to add a finished pod to the result_queue and because of this _change_state is never called on the task instance and thus the task instance is never removed from self.running set of running task instances and in turn the "critical section" of queueing tasks is never attempted.

I have posted more details about it here in the airflow slack: https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1694033145572849

@Cricktom
Copy link

I am also facing the same issue with airflow 2.6.2. Airflow ui shows no tasks are running but the Celery executor open slots are stuck to zero whereas running slots are stuck at 32 which results skip over queuing/running any more tasks.

AIRFLOW__CORE__PARALLELISM is 32 and the number of workers running is 2 with having AIRFLOW__CELERY__WORKER_CONCURRENCY value 8

I'm not really sure how to reproduce it, because it happens over a period of time, and that too sporadically.

@ephraimbuddy
Copy link
Contributor

I am also facing the same issue with airflow 2.6.2. Airflow ui shows no tasks are running but the Celery executor open slots are stuck to zero whereas running slots are stuck at 32 which results skip over queuing/running any more tasks.

AIRFLOW__CORE__PARALLELISM is 32 and the number of workers running is 2 with having AIRFLOW__CELERY__WORKER_CONCURRENCY value 8

I'm not really sure how to reproduce it, because it happens over a period of time, and that too sporadically.

Looks like you have a different issue, can you create a separate issue with more information?

@Cricktom
Copy link

Thank you @ephraimbuddy for taking a look at this. Here are more details that support the same behavior as mentioned in this bug.

The below image shows the number of tasks running for the celery executor. It slowly goes up and when it reaches the limit, sometimes it doesn't come down until we restart the scheduler.
image

The below image shows the number of open slots for the celery executor, which follows the same pattern as above because it's inversely proportional to the number of running tasks.
image

The below image shows the number of tasks running per pool in the airflow webserver UI. Here we can see that the number of running tasks are not in proportion with the celery executor slots and we have observed that after a point Airflow stops scheduling any task at all because it doesn't find any open slot of the executor.
image

To mitigate this issue, we are restarting the scheduler every hour. Please suggest if we can follow any better approach to overcome the same.

@adrianrego
Copy link

I'm also seeing this issue on our setup...Airflow 2.7.1, KubernetesExecutor. Running 3 scheduler pods. Looks like things start going downhill after 5 days. Restarting the schedulers gets things moving again.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 20, 2023
@sa2413
Copy link

sa2413 commented Oct 27, 2023

any update on this issue? facing this on multiple k8s clusters

@potiuk potiuk added pinned Protect from Stalebot auto closing and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Oct 27, 2023
@potiuk
Copy link
Member

potiuk commented Oct 27, 2023

No, but if someone would like to spend time on trying to analyse it it would be great. Until then - if you experience it, I recommend restarting airflow periodically. Seems like super-simple solution that has no real bad side-effects and can be implemented in 5 minutes rather than spending days of volunteers trying to understand what's the root cause (plus it could be coming from the deployment issue).

Pragmatically, seems like easy way to solve your problem at least for now if you are eager to get your mutliple clusters.

Another option might be to fund an effort of someone to help to analyse it maybe? @sa2413 - maybe your case/company would lile to sponsor some maintainers for example to investigate it deeply? We've never done anything like that - but if you are eager to get a solution, maybe that's the right way of solving it, I wonder?

@dirrao
Copy link
Contributor

dirrao commented Dec 20, 2023

I have created a new issue #36335 for celery executor. Let's keep this for Kubernetes executor.

@dirrao
Copy link
Contributor

dirrao commented Dec 22, 2023

@llamageddon83
We have faced a similar issue in production, identified the root cause, and provided the fix in #36240. If you are looking for an immediate fix, then you can try this patch #36240. As a workaround, you can increase the parallelism to a large number (1OK).

@dirrao
Copy link
Contributor

dirrao commented Dec 28, 2023

Hi @llamageddon83,
Kubernets executor slots leak fix is available in 7.12.0 version. You can try and see if this fix your issue.
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/changelog.html

@ephraimbuddy
Copy link
Contributor

FYI, I updated [scheduler]max_tis_per_dag to 512 in 2.8.0 and saw good improvement in the number of tasks ran per minute

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 31, 2024
Copy link

github-actions bot commented Feb 8, 2024

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 8, 2024
@shahar1 shahar1 removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 25, 2024
@shahar1 shahar1 reopened this Sep 25, 2024
Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 20, 2024
@shahar1 shahar1 removed stale Stale PRs per the .github/workflows/stale.yml policy file pending-response labels Oct 23, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this issue Oct 28, 2024
…hedulers

A race condition occurs in the _adopt_completed_pods function when schedulers are running concurrently. _adopt_completed_pods function doesn't keep track of which scheduler went down so it constantly tries to adopt completed pods from normally working schedulers. On Airflow setups with concurrently running schedulers and with a lot of short living DAG's it leads to race condition and open slots leak. You can find detailed analysis of this situation in GitHub issue here (apache#32928 (comment)). The _adopt_completed_pods function was refactored to the _delete_orphaned_completed_pods function, which removes only completed pods that are not bound to running schedulers.

Co-authored-by: Vlad Pastushenko <[email protected]>
@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 30, 2024
@omametrika
Copy link

omametrika commented Dec 10, 2024

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

I'm on 2.10.3 and have just experienced this very issue.

Edit: ...which went away after another scheduler restart. Very weird. Will monitor and report back if it happens again.

@potiuk
Copy link
Member

potiuk commented Dec 10, 2024

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ?

I'm on 2.10.3 and have just experienced this very issue.

Edit: ...which went away after another scheduler restart. Very weird. Will monitor and report back if it happens again.

👀 Would be great if could provide some more details of your setup/ obsrved behaviour when it happens again - similar to those observations above. It could be that the issue is similar, but not the same. Ideally - open an new discussion and link it to that one as "related" - it's always easier to mark issues / discussions as duplicates than to split unrelated issues from single thread/discussion.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 11, 2024
Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 27, 2024
Copy link

github-actions bot commented Jan 7, 2025

This issue has been closed because it has not received response from the issue author.

1 similar comment
Copy link

github-actions bot commented Jan 7, 2025

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.5 Issues Reported for 2.5 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response pinned Protect from Stalebot auto closing stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet