Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with retry logic in response to certain database errors #40882

Open
2 tasks done
jmaicher opened this issue Jul 19, 2024 · 0 comments
Open
2 tasks done

Problem with retry logic in response to certain database errors #40882

jmaicher opened this issue Jul 19, 2024 · 0 comments
Labels
area:core area:MetaDB Meta Database related issues. kind:bug This is a clearly a bug

Comments

@jmaicher
Copy link
Contributor

jmaicher commented Jul 19, 2024

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

We run airflow with an HA postgres deployment via patroni on k8s. In response to temporary network problems, for example, there may be a brief failover to the postgres follower, which is in read-only mode. When this happens, write transactions, typically heartbeat updates, fail for a brief period with:

sqlalchemy.exc.InternalError: (psycopg2.errors.ReadOnlySqlTransaction) cannot execute UPDATE in a read-only transaction

During such events, which typically resolve within a few seconds at the infrastructure level, we observe scheduler crashes and also failing tasks, primarily because heartbeat updates fail. Interestingly, the job do not fail due to the above error, but instead with:

Failed to execute job [...] for task [...] ((psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block)

We traced the problem to the retry_db_transaction decorator, which issues retries for DBAPIError (see #19856, and OperationalError, which is already covered by the former), but only issues a rollback in response to OperationalError.

The ReadOnlySqlTransaction error in our case is wrapped in SQLAlchemy's InternalError, for which we retry, but do not rollback the transaction. So all retries and eventually the whole operation will fail with InFailedSqlTransaction, regardless of whether the original problem resolved at the infrastructure level or not.

What you think should happen instead?

The retry_db_transaction should recover correctly from these issues.

How to reproduce

Start airflow with postgres (e.g. via breeze) and briefly toggle default_transaction_read_only:

postgres=# ALTER SYSTEM SET default_transaction_read_only TO on;
postgres=# SELECT pg_reload_conf();

postgres=# ALTER SYSTEM SET default_transaction_read_only TO off;
postgres=# SELECT pg_reload_conf();

Operating System

debian 12 (bookworm)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Custom HA postgres deployment via patroni

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jmaicher jmaicher added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 19, 2024
@dosubot dosubot bot added the area:MetaDB Meta Database related issues. label Jul 19, 2024
jmaicher added a commit to jmaicher/airflow that referenced this issue Jul 19, 2024
In apache#19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.

In certain cases (see apache#40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
potiuk pushed a commit that referenced this issue Jul 19, 2024
In #19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.

In certain cases (see #40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
@shahar1 shahar1 removed the needs-triage label for new issues that we didn't triage yet label Jul 19, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this issue Jul 26, 2024
In apache#19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.

In certain cases (see apache#40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:MetaDB Meta Database related issues. kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants