-
Notifications
You must be signed in to change notification settings - Fork 14.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task stuck in upstream_failed #18011
Comments
I just recently upgraded to on some cases, upstream tasks succeeds, but the downstream tasks are stuck in an |
Following the discussions at #17819 , you applied the fix then in your DAG above, |
@Tonkonozhenko Do you have |
We're also seeing this. Most of our DAGs have well under 100 tasks, a few just under 200 tasks, 673 active DAGs, 179 paused DAGs. We do not use We started seeing this after upgrading to 2.1.3 which we upgraded to specifically get the bug fix PR #16301, not sure if that bug might be related since we seem to be having weird status issues all over Airflow... We see this in all manner of DAGs, some with a very linear path, some that branch into 100 tasks and then back to 1, others with 2 pre-requisite tasks into the final task. Behavior:
Please advise on what other information we can provide. |
@WattsInABox. If you can get scheduler logs when this happens, that would be very helpful. |
@ephraimbuddy, @WattsInABox perfectly explained what happens. We have the completely same situation. |
@Tonkonozhenko @WattsInABox Do you see If there’s reproducible step please share |
@ephraimbuddy unfortunately, I don't have 2.1.3 logs now, but for 2.1.2 no such error and no fatal errors at all |
Trying to get to a reproducible step here... Is there an existing "unit" test (or could you help me write a unit test) for:
And then see if the failure & retry handlers do what I think they're doing? That is:
|
Hi @ephraimbuddy - I work with @WattsInABox. We don't see
This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks will now retry since we have #16301, and will eventually succeed. Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal for tasks that take an hour plus. I suspect also at times this results in the downstream tasks being set to upstream_failed when in fact the upstream is all successful - but I can't prove it. We tried to bump the Our pool sqlalchemy pool size is 350, this might be high - but my understanding is the pool does not create connections until they are needed, and according to AWS monitoring the max connections we ever hit at peak time is ~300-370 which should be totally manageable on our Do you have any additional advice on things to try? |
It's not supposed to set B to upstream_failed if A has retries. What I believe happened is that the executor reported that A has failed but A is still queued in Scheduler. Currently, A is failed directly which we are trying to fix at #17819. You can temporarily add a patch that removes this two lines: airflow/airflow/jobs/scheduler_job.py Lines 654 to 655 in 2b80c1e
and wait for #17819 to be fixed. EDIT |
In 2.1.4 we added some limits( to the number of queued dagruns the scheduler can create and I'm suspecting that the issue we have on database connections will go with it. I was having |
@taylorfinnell , I will suggest you to increase the value of this configuration worker_pods_pending_timeout, not sure if it’ll resolve it but it’s also connected with sending SIGTERM to task runner because pods are deleted by it. |
Thanks! It seems to me that setting is specific to the k8s executor - but we are using the CeleryExecutor |
That ERROR basically says it can't connect to metadata DB -- Where do you have your Metadata DB? |
Our metadata DB is in AWS and is a db.4xlarge that mostly looks like its chilling out doing nothing every day. The most action we see is spikes to 350 connections (there's enough RAM for 1750 connections). We're working on weeding out if the spikes are causing issues, but IMHO Airflow should not be falling over in the heartbeats b/c of a first-time missed connection. There should be some intelligent retry logic in the heartbeats... |
Indeed, we do have some retries in few place, this might not be the one and needs improving. Does this error occur without those network blips / DB connectivity issues? Can someone comments steps to reproduce please |
Actually I do not agree with that statement. Airflow should rely on the metadata database being available at all times and loosing connectivity in the middle of transaction should not be handled by Airflow. That adds terrible complexity to your code and IMHO is not needed to deal with this kind of (apparent) instabilities of connectivity. Especially that it is a timeout on trying to connect to the database. In case of SQLAlchemy and ORM database level we often do not have control on when your session and connection is going to be established and trying to handle all such failures on application level is complex AND also it is not needed on application level - especially in case of Postgres. For quite some time (and also in our Helm Chart - for a long time we recommend everyone using Postgres to use PGBouncer as a proxy to your Postgres database. It deals nicely also with a number of connections open (Postgres is not good in handling many parallel connections - it's connection model is process based and thus it is resource hungry when there are many connections opened) PGBouncer does not only handle managing of connections pools shared between components, but also allows to react on similar network connection conditions - first of all, it will reuse existing connections, so there will be far less connection open/close events between PGBouncer and the Database. All the connections opened by airflow will go to locally available PGBouncer which will make them toally resilient to networking issue. Then PGBouncer will handle errors which you can fine-tune if you have connectivity problems to your database. @WattsInABox - can you please add PGBouncer (s) to your deployment and let us know if that improved the situation. I think this is not even a workaround - it's actually a good solution (which we generally recommend for any deployment with Postgres). I will convert it into discussion until we hear back from you - with your experiences with PGBouncer and if those problems are still occuring after you get PGBouncer running, with some reproducible case. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Apache Airflow version
2.1.3 (latest released)
Operating System
Debian
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
What happened
What you expected to happen
The task should not be stuck
How to reproduce
No response
Anything else
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: