-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow progressive slowness #32928
Comments
We have had a minor release since 2.5.3, could you try this with Airflow 2.6.3? |
Hi @ephraimbuddy thank you for taking a look at this. Yes I will bump the version and see if the issue persists. It will take about 2-3 weeks to push the change through to production. |
Hi @ephraimbuddy. We pushed 2.6.3 to prod last week. And the progressive slowness has started to creep up again. Version bump didn't seem to remedy the issue. |
Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference? I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis. Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths). |
Hi @potiuk Is it possible that you do check which tasks are in "running" state for some time when it happens and maybe try to find some correlated logs from scheduler and task executtion - and ideally also find the logs from another task instance of the same task that was not found in "running" so that we could compare them and see the difference?
I preseume that there is some kind of race happenin that will somehow skip removal of the task even if it has been completed. - but seeing the logs and comparing them could narrow down the searrch an might allow us to come up with a plausible hypothesis.
Question - do you run deferrable tasks / triggerer ? Maybe somehow there is a problem with deferrable code ? (wild guessing now after looking into some possible paths).
Thank you |
I am having the same issue with airflow 2.6.3. Database and airflow ui shows no tasks are running but the kubernetes executor thinks there are 64 (parallelism=64) running tasks and skips over queuing / running any more tasks. My initial thoughts on this is it appears that within the kubernetes_executor.py the I have posted more details about it here in the airflow slack: https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1694033145572849 |
I am also facing the same issue with airflow 2.6.2. Airflow ui shows no tasks are running but the Celery executor open slots are stuck to zero whereas running slots are stuck at 32 which results skip over queuing/running any more tasks. AIRFLOW__CORE__PARALLELISM is 32 and the number of workers running is 2 with having AIRFLOW__CELERY__WORKER_CONCURRENCY value 8 I'm not really sure how to reproduce it, because it happens over a period of time, and that too sporadically. |
Looks like you have a different issue, can you create a separate issue with more information? |
Thank you @ephraimbuddy for taking a look at this. Here are more details that support the same behavior as mentioned in this bug. The below image shows the number of tasks running for the celery executor. It slowly goes up and when it reaches the limit, sometimes it doesn't come down until we restart the scheduler. The below image shows the number of open slots for the celery executor, which follows the same pattern as above because it's inversely proportional to the number of running tasks. The below image shows the number of tasks running per pool in the airflow webserver UI. Here we can see that the number of running tasks are not in proportion with the celery executor slots and we have observed that after a point Airflow stops scheduling any task at all because it doesn't find any open slot of the executor. To mitigate this issue, we are restarting the scheduler every hour. Please suggest if we can follow any better approach to overcome the same. |
I'm also seeing this issue on our setup...Airflow 2.7.1, KubernetesExecutor. Running 3 scheduler pods. Looks like things start going downhill after 5 days. Restarting the schedulers gets things moving again. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
any update on this issue? facing this on multiple k8s clusters |
No, but if someone would like to spend time on trying to analyse it it would be great. Until then - if you experience it, I recommend restarting airflow periodically. Seems like super-simple solution that has no real bad side-effects and can be implemented in 5 minutes rather than spending days of volunteers trying to understand what's the root cause (plus it could be coming from the deployment issue). Pragmatically, seems like easy way to solve your problem at least for now if you are eager to get your mutliple clusters. Another option might be to fund an effort of someone to help to analyse it maybe? @sa2413 - maybe your case/company would lile to sponsor some maintainers for example to investigate it deeply? We've never done anything like that - but if you are eager to get a solution, maybe that's the right way of solving it, I wonder? |
I have created a new issue #36335 for celery executor. Let's keep this for Kubernetes executor. |
@llamageddon83 |
Hi @llamageddon83, |
FYI, I updated |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
…hedulers A race condition occurs in the _adopt_completed_pods function when schedulers are running concurrently. _adopt_completed_pods function doesn't keep track of which scheduler went down so it constantly tries to adopt completed pods from normally working schedulers. On Airflow setups with concurrently running schedulers and with a lot of short living DAG's it leads to race condition and open slots leak. You can find detailed analysis of this situation in GitHub issue here (apache#32928 (comment)). The _adopt_completed_pods function was refactored to the _delete_orphaned_completed_pods function, which removes only completed pods that are not bound to running schedulers. Co-authored-by: Vlad Pastushenko <[email protected]>
Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @llamageddon83 ? |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
I'm on 2.10.3 and have just experienced this very issue. Edit: ...which went away after another scheduler restart. Very weird. Will monitor and report back if it happens again. |
👀 Would be great if could provide some more details of your setup/ obsrved behaviour when it happens again - similar to those observations above. It could be that the issue is similar, but not the same. Ideally - open an new discussion and link it to that one as "related" - it's always easier to mark issues / discussions as duplicates than to split unrelated issues from single thread/discussion. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
1 similar comment
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
We are running Airflow on EKS with version 2.5.3. Airflow has been experiencing progressive slowness over a period of 2-3 weeks where DAGs start getting queued without ever executing and leads us to restart the scheduler pod. After the pod restart, problem goes away for a few days and then starts to slowly creep back up.
The pods, the logs and the dashboards all look healthy, the UI shows that no tasks are currently running, and that there are no worker pods alive. The resource usage graphs (CPU, memory) also look what they should if no DAGs are actually executing.
During one such outage, we disabled all the DAGs and marked all the tasks as success just to see if scheduler is able to spin up new worker pods. Scheduler never recovered and we restarted the scheduler pod.
However, there is one dashboard that shows metrics named
Executor running tasks
andExecutor open slots
. We noticed that this dashboard was accurately representing the slowness behavior. Over a period of time, number of open slots would decrease and vice versa for running tasks. These two would never reset even when nothing is running during a long period of time which is every day between 10:00 PM to 8:00 AM.These metrics are coming from
base_exeuctor
:and
num_running_tasks
is defined asnum_running_tasks = len(self.running)
inbase_executor
.So we enabled some logs from
KuberenetesExecutor
under this method to see what was inself.running
:where
self.running
is defined asself.running: set[TaskInstanceKey] = set()
. The log showed that somehow the tasks that have been completed successfully in the past still exist inself.running
. For example, a snippet of the log outputted on the 28th is holding on to the tasks that have already been successfully completed on the 24th and 27th:We validated that these tasks have been completed without any issue from the UI and Postgres DB (which we use as the metadata backend).
Once the scheduler pod is restarted, the problem goes away, the metrics in Grafana dashboard reset and tasks start executing.
What you think should happen instead
Airflow's scheduler is keeping a track of currently running tasks and their state in memory. And that state in some cases is not getting cleared. The tasks that have been completed should eventually be cleared from
running
set inKubernetesExecutor
once the worker pod exits.How to reproduce
Beats me. Our initial assumption was that that is a DAG implementation issue and some particular DAG is misbehaving. But this problem has occurred with all sorts of DAGs, happens for scheduled and manual runs, and is sporadic. Tt here is some edge scenario that causes this to happen. But we are unable to nail it down any further.
Operating System
Debian GNU/ Linux 11 (bullseye)
Versions of Apache Airflow Providers
aiofiles==23.1.0
aiohttp==3.8.4
airflow-dbt>=0.4.0
airflow-exporter==1.5.3
anytree==2.8.0
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http>=2.0.3
apache-airflow-providers-microsoft-mssql==2.1.3
apache-airflow-providers-snowflake>=4.0.4
apache-airflow-providers-hashicorp==3.3.0
apache-airflow-providers-cncf-kubernetes==5.2.2
apache-airflow>=2.2.3
asgiref==3.5.0
Authlib==0.15.5
dbt-snowflake==1.5.2
flatdict==4.0.1
hvac==0.11.2
jsonschema>=4.17.3
pandas==1.3.5
psycopg2-binary==2.9.3
pyOpenSSL==23.1.1
pysftp==0.2.9
pysmbclient==0.1.5
python-gnupg==0.5.0
PyYAML~=5.4.1
requests~=2.26.0
smbprotocol==1.9.0
snowflake-connector-python== 3.0.4
snowflake-sqlalchemy==1.4.7
statsd==3.3.0
py7zr==0.20.5
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow is deployed via helm charts on EKS in AWS. There are two scheduler pods with
AIRFLOW__CORE__PARALLELISM
set to10
.Anything else
N/A
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: