-
Notifications
You must be signed in to change notification settings - Fork 14.5k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing #13542
Comments
@Overbryd I'm not sure if this helps, but I was dealing with a similar issue with our self-managed Airflow instance on GKE when we upgraded to 2.0.0 a couple of weeks ago. Are you using If so, we found that updating the pod-template-file to this:
The changes to the original file are as follows:
This allowed gitSync to work (after also passing a If this isn't the issue, the other area that you may want to look into is making sure that your service account binding annotations are properly set for you scheduler, webserver, and workers in your |
@Overbryd Did the suggestion in above comment help? |
We are experiencing these symptoms on
Does any work have a clue why this is happening? |
@kaxil first of all, sorry for my late reply. I am still active on this issue, just so you know. I have been quite busy unfortunately. You asked whether this might be git-sync related. Its a bit hard to track down precisely, and I could only ever see it when using KubernetesExecutor. The only way I could trigger it manually once (as state in my original post) was when I was clearing a large number of tasks at once. |
I'm seeing this behaviour as well. I could not reliably reproduce it, but my experience matches that of @Overbryd - especially when clearing many tasks that can run in parallel. I noticed that the indefinitely queued tasks produce an error in the scheduler log:
Looking further at the log I notice that, the task gets processed normally first, but then gets picked up again leading to the mentioned error.
|
This is happening with Celery executor as well. |
This is probably not helpful because you are on 2.x but our solution was to set AIRFLOW__SCHEDULER__RUN_DURATION which will restart the scheduler every x hours. You could probably achieve something similar tho. |
I had the same problem today and I think I found the problem. I'm testing with: I was testing one dag and after changing a few parameters in one of the tasks in the dag file and cleaning the tasks, the task got stuck on scheduled state.
So, the worker was refusing to execute because I was passing an invalid argument to the task. The problem is that the worker doesn't notify (or update the task status to running) the scheduler/web that the file is wrong (no alert of a broken dag was being show in the Airflow home page). After updating the task parameter and cleaning the task, it ran successfully. Ps.: Probably is not the same problem that the OP is having but it's related to task stuck on scheduled |
I'm facing the same issue as OP and unfortunately what @renanleme said does not apply to my situation.
I wouldn't mind restarting the scheduler, but it is not clear for me the reason of the hanging queued tasks. In my environment, it appears to be very random. |
Back on this. I am currently observing the behaviour again. I can confirm:
The issue persists with The issue is definitely "critical" as it halts THE ENTIRE airflow operation...! |
Is this related to #14924? |
I have replicated this, will be working on it |
Unassigning myself as I can't reproduce the bug again. |
I ran into this issue due to the scheduler over-utilizing CPU because our The stack I observed this on: The behavior I observed was that the scheduler would mark tasks are "queued", but never actually send them to the queue. I think the scheduler does the actual queueing via the executor, so I suspect that the executor is starved of resources and unable to queue the new task. My manual workaround until correcting the I suspect the OP may have this same issue because they mentioned having 100% CPU utilization on their scheduler. |
We have changed the default For existing deployment, a user will have to change this manually in their although I wouldn't think that might be the cause to task staying in the queued state but worth a try |
I had the task stuck because I changed the DAG id as part of my code change. After reverting the DAG id change, things worked fine. However the old queued tasks had to be manually marked complete. |
@kaxil I have checked, @SalmonTimo I have a pretty high CPU utilisation (60%), albeit the scheduler settings are default. But why? Does this matter? –– Same issue, new day: I have Airflow running, the scheduler running, but the whole cluster has 103 scheduled tasks and 3 queued tasks, but nothing is running at all. I highly doubt that -- What we need here is some factual inspection of the Python process. Following that stack trace idea, I just learned that Python cannot dump a process (https://stackoverflow.com/a/141826/128351), unfortunately, otherwise I would have provided you with such a process dump of my running "scheduler". I am very happy to provide you with some facts about my stalled scheduler, if you tell me how you would debug such an issue. What I currently have:
What I find striking, is the message
|
@lukas-at-harren -- Can you check the Airflow Webserver -> Admin -> Pools and then in the row with your pool ( It is possible that they are not actually running but somehow got in that state in DB. If you see 3 entries over here, please mark those tasks as success or failed, that should clear your pool. |
Wow @potiuk I totally missed that update! Huge news! I'll check that out and see if it helps. |
@danmactough Did this update help? We're tracking this issue before upgrading. |
@WattsInABox Unfortunately, the AWS MWAA "upgrade" path is not an upgrade path, but rather a "stand up a new cluster and move all your DAGs with no way to retain historical Airflow metadata" path. Which ok fine, maybe that's what we should do in any case, but it does mean testing out 2.2.2 is going to require some planning. |
@danmactough ouch! Thanks for responding to a rando, we appreciate it :) |
We've upgraded to a 2.2.2 MWAA environment and are encountering the similar queuing behavior. Tasks remain in the queued state for about fifteen minutes before executing. This is in an extremely small dev environment we're testing. Unfortunately, even the unstick_tag task remains in a queued state. |
@DVerzal I propose you followhttps://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=faq#why-is-task-not-getting-scheduled - and review resourcesa and configuration of your Airlfow and open a new issue (if this does not help) with detailed information on your configuraiton and logs. It is likely different problem than what you describe. |
Also I suggest to open issue to the MWAA support - maybe this is simply some problem with MWAA configuration. |
Thanks for pointing me in the right direction, @potiuk. We're planning to continue with our investigation when some resources free up to continue the migration. |
Hello, found same issue when i used ver 2.2.4 (latest) |
@haninp - this might be (and likely is - because MWAA which plays a role here has no 2.2.4 support yet) completely different issue. It's not helpful to say "I also have similar problem" without specifying details, logs . As a "workaround" (or diagnosis) I suggest you to follow this FAQ here: https://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=faq#why-is-task-not-getting-scheduled and double check if your problem is not one of those with the configuration that is explained there. If you find you stil have a problem, then I invite you to describe it in detail in a separate issue (if this is something that is easily reproducible) or GitHub Discussion (if you have a problem but unsure how to reproduce it). Providing as many details such as your deployment details, logs, circumstances etc. are crucial to be able to help you. Just stating "I also have this problem" helps no-one (including yourself because you might thiink you delegated the problem and it will be solved, but in fact this might be a completely different problem. |
Hi All! With release 2.2.5 scheduling issues have gone away for me.
I am still using mostly SubDags instead of TaskGroups, since the latter makes the tree view incomprehensible. If you have a similar setup, then give 2.2.5 release a try! |
For those having problems with MWAA, I had this error today, and couldn't wait for 2.2.5 release in MWAA to finish my company migration project, so since we have 17 DAGs in my company, with 8-9 steps as median of tasks (one have 100+ tasks, for each table in our DB, runs a import/CDC and validation task), being all those 17 running once a day at night. I went a bit extreme with reducing the load on the scheduler, and looks like it's working properly (for our current use cases, scale) after a few tests today. If anyone want to experiment and are having the same problem with similar settings, here's the configurations i've changed, using
|
#21455 (comment) |
Also experiencing this issue on MWAA 2.2.2. Seeing the same pattern as another commenter, in that when a DAG gets "stuck" with the first task in a queued state, it takes 15 minutes to sort itself out. Our MWAA 2.0.2 instances never exhibited this behavior. Has anyone had any luck in finding a workaround/fix suitable for an MWAA 2.2.2 mw1.small instance (i.e. something that doesn't involve upgrading to a later Airflow version)? UPDATE: for anyone using MWAA v2.2.2 who is experiencing the issue of tasks being stuck in a "queued" state for 15 minutes even when the worker pool has no tasks being executed, what has worked for us is to set the "celery.pool" configuration option to "solo". This resolved the issue for us immediately, though may have some knock-on impact in terms of worker throughput, so you may need to scale workers accordingly in some situations. |
From Airflow doc : Taking @val2k script and changing the max_tries to 0 & state to None fixed the script for us
|
Question for other MWAA users.. have you guys tried setting max-workers==min-workers, basically disabling autoscaling? Is anyone without autoscaling actually seeing this stuff, regardless of airflow version? We've also talked to the MWAA team, and haven't heard clear answers about whether messages/workers are properly drained when down-scaling, so I'm wondering if that's not the crux of this issue, basically where queue state becomes inconsistent due to weird race conditions with improper worker shutdown. As the MWAA backend is pretty opaque to end-users, it's possible that downscaling is nothing more complicated or careful than just terminating an EC2 worker, or fargate pod, or whatever. However, IDK much about airflow/celery internals as far as redelivery, dead-letter queues, etc, so I might be way off base here. Since this is something that arguably could/should be fixed in a few different places (the MWAA core infrastructure, or the celery codebase, or the airflow codebase).. it seems likely that the problem may stick around for a while as well as the confusion about what versions are affected. The utility DAGs in this thread are an awesome reference ❤️ , and it may come to that but still hoping for a different work-around. Airflow version-upgrades or something would also leave us with a big stack of things to migrate and we can't jump into that immediately. Without autoscaling we can expect things to get more expensive, but we're thinking maybe it's worth it at this point to buy more stability. Anyone got more info? |
@mattvonrocketstein In our case, we dynamically create dags, so the MWAA team's first suggestion was to reduce the load on the scheduler by increasing the refresh dags interval. It seems to help. We see fewer errors in the logs, and tasks getting stuck less often, but it didn't resolve the issue. Now we are waiting for the second round of suggestions. |
Specific to MWAA, our team had a similar issue. We found this to be an issue with the MWAA default value of Workers have a different number of vCPUs per environment. Defaults are:
Our resolution was to lower the value of A manageable value per environment class is
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Apache Airflow version:
2.0.0
Kubernetes version (if you are using kubernetes) (use
kubectl version
):Environment:
uname -a
):What happened:
KubernetesExecutor
has many tasks stuck in "scheduled" or "queued" state which never get resolved.default_pool
of 16 slots.('Not scheduling since there are %s open slots in pool %s and require %s pool slots', 0, 'default_pool', 1)
That is simply not true, because there is nothing running on the cluster and there are always 16 tasks stuck in "queued".
Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
That is also not true. Nothing is running on the cluster and Airflow is likely just lying to itself. It seems the KubernetesExecutor and the scheduler easily go out of sync.
What you expected to happen:
How to reproduce it:
Vanilla Airflow 2.0.0 with
KubernetesExecutor
on Python3.7.9
requirements.txt
The only reliable way to trigger that weird bug is to clear the task state of many tasks at once. (> 300 tasks)
Anything else we need to know:
Don't know, as always I am happy to help debug this problem.
The scheduler/executer seems to go out of sync and never back in sync again with the state of the world.
We actually planned to upscale our Airflow installation with many more simultaneous tasks. With these severe yet basic scheduling/queuing problems we cannot move forward at all.
Another strange, likely unrelated observation, the scheduler always uses 100% of the CPU. Burning it. Even with no scheduled or now queued tasks, its always very very busy.
Workaround:
The only workaround for this problem I could find so far, is to manually go in, find all tasks in "queued" state and clear them all at once. Without that, the whole cluster/Airflow just stays stuck like it is.
The text was updated successfully, but these errors were encountered: