Task stuck in upstream_failed #18011

Tonkonozhenko · 2021-09-03T13:01:04Z

Apache Airflow version

2.1.3 (latest released)

Operating System

Debian

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

What happened

task prepare_timestamps failed with unknown k8s-related issue (most likely pod was killed)
task prepare_timestamps succeeded on the second try
Downstream task create_athena_partition stuck in upstream_failed state. Also, we experienced the same issue in 2.1.2 and then the task stuck in the queued state.

What you expected to happen

The task should not be stuck

How to reproduce

No response

Anything else

No response

Code of Conduct

I agree to follow this project's Code of Conduct

rpfernandezjr · 2021-09-03T14:31:29Z

I just recently upgraded to 2.1.3, and I'm seeing the same thing.

on some cases, upstream tasks succeeds, but the downstream tasks are stuck in an upstream_failed state.

ephraimbuddy · 2021-09-03T17:27:48Z

Related: #16625
PR trying to address this: #17819

ephraimbuddy · 2021-09-16T16:01:57Z

Following the discussions at #17819 , you applied the fix then in your DAG above, prepare_timestamps failed and retried but create_athena_partition is stuck in up_for_retry?

ephraimbuddy · 2021-09-16T18:17:14Z

@Tonkonozhenko Do you have wait_for_downstream args set?

WattsInABox · 2021-09-17T01:08:19Z

We're also seeing this. Most of our DAGs have well under 100 tasks, a few just under 200 tasks, 673 active DAGs, 179 paused DAGs. We do not use wait_for_downstream anywhere.

We started seeing this after upgrading to 2.1.3 which we upgraded to specifically get the bug fix PR #16301, not sure if that bug might be related since we seem to be having weird status issues all over Airflow...

We see this in all manner of DAGs, some with a very linear path, some that branch into 100 tasks and then back to 1, others with 2 pre-requisite tasks into the final task.

Behavior:

upstream tasks all successful
downstream task(s) marked as upstream_failed
sometimes an upstream task will have a previous run marked as failed but then it retries as successful, almost as if the downstream tasks get marked as upstream_failed on that run but then don't get cleared for the subsequent retry. But this does not always happen: we have seen multiple dagruns a night have upstream_failed tasks where all tasks prior worked on their first attempt (or at least only have logs for 1 attempt).

Please advise on what other information we can provide.

ephraimbuddy · 2021-09-17T01:31:15Z

@WattsInABox. If you can get scheduler logs when this happens, that would be very helpful.

Tonkonozhenko · 2021-09-17T07:48:36Z

@ephraimbuddy, @WattsInABox perfectly explained what happens. We have the completely same situation.

ephraimbuddy · 2021-09-17T08:48:26Z

@Tonkonozhenko @WattsInABox Do you see FATAL: sorry, too many clients already. in your scheduler logs when this happens?

If there’s reproducible step please share

Tonkonozhenko · 2021-09-17T10:36:05Z

@ephraimbuddy unfortunately, I don't have 2.1.3 logs now, but for 2.1.2 no such error and no fatal errors at all

WattsInABox · 2021-09-17T13:08:38Z

Trying to get to a reproducible step here...

Is there an existing "unit" test (or could you help me write a unit test) for:

A -> B dag
A set to fail with retries more than 1

And then see if the failure & retry handlers do what I think they're doing? That is:

A set to failed
B set to upstream_failed
A retries
B is untouched
A succeeds
B left in upstream_failed

taylorfinnell · 2021-09-18T22:52:47Z

Hi @ephraimbuddy - I work with @WattsInABox. We don't see FATAL: sorry, too many clients already. but we do see:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 202, in heartbeat
    session.merge(self)
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge
    return self._merge(
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge
    merged = self.query(mapper.class_).get(key[1])
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get
    return self._get_impl(ident, loading.load_on_pk_identity)

....

psycopg2.OperationalError: could not connect to server: Connection timed out

This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks will now retry since we have #16301, and will eventually succeed. Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal for tasks that take an hour plus. I suspect also at times this results in the downstream tasks being set to upstream_failed when in fact the upstream is all successful - but I can't prove it.

We tried to bump the AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 60 to maybe ease up on hitting the database with no luck. This error also happens when only a couple DAGs are running so there is not much load on our nodes or the database. We don't think it's a networking issue.

Our pool sqlalchemy pool size is 350, this might be high - but my understanding is the pool does not create connections until they are needed, and according to AWS monitoring the max connections we ever hit at peak time is ~300-370 which should be totally manageable on our db.m6g.4xlarge instance. However, if it's a 350 pool for each worker and each worker opens tons of connections that are then alive in the pool - perhaps we are exhausting PG memory

Do you have any additional advice on things to try?

ephraimbuddy · 2021-09-19T22:15:41Z

Trying to get to a reproducible step here...

Is there an existing "unit" test (or could you help me write a unit test) for:

A -> B dag

A set to fail with retries more than 1

And then see if the failure & retry handlers do what I think they're doing? That is:

A set to failed

B set to upstream_failed

A retries

B is untouched

A succeeds

B left in upstream_failed

It's not supposed to set B to upstream_failed if A has retries. What I believe happened is that the executor reported that A has failed but A is still queued in Scheduler. Currently, A is failed directly which we are trying to fix at #17819.

You can temporarily add a patch that removes this two lines:

airflow/airflow/jobs/scheduler_job.py

Lines 654 to 655 in 2b80c1e

    
           self.log.info('Setting task instance %s state to %s as reported by executor', ti, state) 
        
           ti.set_state(state)

and wait for #17819 to be fixed.

EDIT
Since you’re getting SIGTERM as explained by @taylorfinnel, this seems related to pending pod timeout deletion. Increase this interval worker_pods_pending_timeout

ephraimbuddy · 2021-09-19T22:22:58Z

Hi @ephraimbuddy - I work with @WattsInABox. We don't see FATAL: sorry, too many clients already. but we do see:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/airflow/jobs/base_job.py", line 202, in heartbeat
    session.merge(self)
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2166, in merge
    return self._merge(
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 2244, in _merge
    merged = self.query(mapper.class_).get(key[1])
  File "/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/orm/query.py", line 1018, in get
    return self._get_impl(ident, loading.load_on_pk_identity)

....

psycopg2.OperationalError: could not connect to server: Connection timed out
This causes the job to be SIGTERM'ed (most of the time, it seems). The tasks will now retry since we have #16301, and will eventually succeed. Sometimes it is SIGTERM'ed 5 times or more before success - which is not ideal for tasks that take an hour plus. I suspect also at times this results in the downstream tasks being set to upstream_failed when in fact the upstream is all successful - but I can't prove it.

We tried to bump the AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 60 to maybe ease up on hitting the database with no luck. This error also happens when only a couple DAGs are running so there is not much load on our nodes or the database. We don't think it's a networking issue.

Our pool sqlalchemy pool size is 350, this might be high - but my understanding is the pool does not create connections until they are needed, and according to AWS monitoring the max connections we ever hit at peak time is ~300-370 which should be totally manageable on our db.m6g.4xlarge instance. However, if it's a 350 pool for each worker and each worker opens tons of connections that are then alive in the pool - perhaps we are exhausting PG memory

Do you have any additional advice on things to try?

In 2.1.4 we added some limits( to the number of queued dagruns the scheduler can create and I'm suspecting that the issue we have on database connections will go with it. I was having FATAL: sorry, too many clients already. db error until the queued dagruns was limited in this PR #18065.

ephraimbuddy · 2021-09-20T00:55:05Z

@taylorfinnell , I will suggest you to increase the value of this configuration worker_pods_pending_timeout, not sure if it’ll resolve it but it’s also connected with sending SIGTERM to task runner because pods are deleted by it.

taylorfinnell · 2021-09-20T01:17:27Z

Thanks! It seems to me that setting is specific to the k8s executor - but we are using the CeleryExecutor

kaxil · 2021-09-20T11:06:55Z

@Tonkonozhenko @taylorfinnell

psycopg2.OperationalError: could not connect to server: Connection timed out

That ERROR basically says it can't connect to metadata DB -- Where do you have your Metadata DB?

WattsInABox · 2021-09-20T14:33:19Z

Our metadata DB is in AWS and is a db.4xlarge that mostly looks like its chilling out doing nothing every day. The most action we see is spikes to 350 connections (there's enough RAM for 1750 connections). We're working on weeding out if the spikes are causing issues, but IMHO Airflow should not be falling over in the heartbeats b/c of a first-time missed connection. There should be some intelligent retry logic in the heartbeats...

kaxil · 2021-09-20T15:00:23Z

Our metadata DB is in AWS and is a db.4xlarge that mostly looks like its chilling out doing nothing every day. The most action we see is spikes to 350 connections (there's enough RAM for 1750 connections). We're working on weeding out if the spikes are causing issues, but IMHO Airflow should not be falling over in the heartbeats b/c of a first-time missed connection. There should be some intelligent retry logic in the heartbeats...

Indeed, we do have some retries in few place, this might not be the one and needs improving. Does this error occur without those network blips / DB connectivity issues?

Can someone comments steps to reproduce please

potiuk · 2021-09-20T23:07:12Z

IMHO Airflow should not be falling over in the heartbeats b/c of a first-time missed connection. There should be some intelligent retry logic in the heartbeats...

Actually I do not agree with that statement.

Airflow should rely on the metadata database being available at all times and loosing connectivity in the middle of transaction should not be handled by Airflow. That adds terrible complexity to your code and IMHO is not needed to deal with this kind of (apparent) instabilities of connectivity. Especially that it is a timeout on trying to connect to the database. In case of SQLAlchemy and ORM database level we often do not have control on when your session and connection is going to be established and trying to handle all such failures on application level is complex

AND also it is not needed on application level - especially in case of Postgres. For quite some time (and also in our Helm Chart - for a long time we recommend everyone using Postgres to use PGBouncer as a proxy to your Postgres database. It deals nicely also with a number of connections open (Postgres is not good in handling many parallel connections - it's connection model is process based and thus it is resource hungry when there are many connections opened)

PGBouncer does not only handle managing of connections pools shared between components, but also allows to react on similar network connection conditions - first of all, it will reuse existing connections, so there will be far less connection open/close events between PGBouncer and the Database. All the connections opened by airflow will go to locally available PGBouncer which will make them toally resilient to networking issue. Then PGBouncer will handle errors which you can fine-tune if you have connectivity problems to your database.

@WattsInABox - can you please add PGBouncer (s) to your deployment and let us know if that improved the situation. I think this is not even a workaround - it's actually a good solution (which we generally recommend for any deployment with Postgres).

I will convert it into discussion until we hear back from you - with your experiences with PGBouncer and if those problems are still occuring after you get PGBouncer running, with some reproducible case.

Tonkonozhenko added area:core kind:bug This is a clearly a bug labels Sep 3, 2021

eladkal added area:Scheduler including HA (high availability) scheduler affected_version:2.1 Issues Reported for 2.1 and removed area:core labels Sep 4, 2021

Tonkonozhenko mentioned this issue Sep 16, 2021

Properly handle ti state difference between executor and scheduler #17819

Merged

apache locked and limited conversation to collaborators Sep 20, 2021

potiuk closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Task stuck in upstream_failed #18011

Task stuck in upstream_failed #18011

Tonkonozhenko commented Sep 3, 2021 •

edited

Loading

rpfernandezjr commented Sep 3, 2021

ephraimbuddy commented Sep 3, 2021

ephraimbuddy commented Sep 16, 2021

ephraimbuddy commented Sep 16, 2021

WattsInABox commented Sep 17, 2021

ephraimbuddy commented Sep 17, 2021

Tonkonozhenko commented Sep 17, 2021

ephraimbuddy commented Sep 17, 2021 •

edited

Loading

Tonkonozhenko commented Sep 17, 2021

WattsInABox commented Sep 17, 2021

taylorfinnell commented Sep 18, 2021 •

edited

Loading

ephraimbuddy commented Sep 19, 2021 •

edited

Loading

ephraimbuddy commented Sep 19, 2021

ephraimbuddy commented Sep 20, 2021 •

edited

Loading

taylorfinnell commented Sep 20, 2021

kaxil commented Sep 20, 2021

WattsInABox commented Sep 20, 2021

kaxil commented Sep 20, 2021 •

edited

Loading

potiuk commented Sep 20, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Task stuck in upstream_failed #18011

Task stuck in upstream_failed #18011

Comments

Tonkonozhenko commented Sep 3, 2021 • edited Loading

Apache Airflow version

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

What happened

What you expected to happen

How to reproduce

Anything else

Code of Conduct

rpfernandezjr commented Sep 3, 2021

ephraimbuddy commented Sep 3, 2021

ephraimbuddy commented Sep 16, 2021

ephraimbuddy commented Sep 16, 2021

WattsInABox commented Sep 17, 2021

ephraimbuddy commented Sep 17, 2021

Tonkonozhenko commented Sep 17, 2021

ephraimbuddy commented Sep 17, 2021 • edited Loading

Tonkonozhenko commented Sep 17, 2021

WattsInABox commented Sep 17, 2021

taylorfinnell commented Sep 18, 2021 • edited Loading

ephraimbuddy commented Sep 19, 2021 • edited Loading

ephraimbuddy commented Sep 19, 2021

ephraimbuddy commented Sep 20, 2021 • edited Loading

taylorfinnell commented Sep 20, 2021

kaxil commented Sep 20, 2021

WattsInABox commented Sep 20, 2021

kaxil commented Sep 20, 2021 • edited Loading

potiuk commented Sep 20, 2021

This issue was moved to a discussion.

Tonkonozhenko commented Sep 3, 2021 •

edited

Loading

ephraimbuddy commented Sep 17, 2021 •

edited

Loading

taylorfinnell commented Sep 18, 2021 •

edited

Loading

ephraimbuddy commented Sep 19, 2021 •

edited

Loading

ephraimbuddy commented Sep 20, 2021 •

edited

Loading

kaxil commented Sep 20, 2021 •

edited

Loading