You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We run airflow with an HA postgres deployment via patroni on k8s. In response to temporary network problems, for example, there may be a brief failover to the postgres follower, which is in read-only mode. When this happens, write transactions, typically heartbeat updates, fail for a brief period with:
sqlalchemy.exc.InternalError: (psycopg2.errors.ReadOnlySqlTransaction) cannot execute UPDATE in a read-only transaction
During such events, which typically resolve within a few seconds at the infrastructure level, we observe scheduler crashes and also failing tasks, primarily because heartbeat updates fail. Interestingly, the job do not fail due to the above error, but instead with:
Failed to execute job [...] for task [...] ((psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block)
We traced the problem to the retry_db_transaction decorator, which issues retries for DBAPIError (see #19856, and OperationalError, which is already covered by the former), but only issues a rollback in response to OperationalError.
The ReadOnlySqlTransaction error in our case is wrapped in SQLAlchemy's InternalError, for which we retry, but do not rollback the transaction. So all retries and eventually the whole operation will fail with InFailedSqlTransaction, regardless of whether the original problem resolved at the infrastructure level or not.
postgres=# ALTER SYSTEM SET default_transaction_read_only TO on;
postgres=# SELECT pg_reload_conf();
postgres=# ALTER SYSTEM SET default_transaction_read_only TO off;
postgres=# SELECT pg_reload_conf();
In apache#19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.
In certain cases (see apache#40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
In #19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.
In certain cases (see #40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
In apache#19856, we added `DBAPIError` besides `OperationalError` to the retry exception types, but did not change the `retry_db_transaction` decorator to rollback transaction after failures and before a retry.
In certain cases (see apache#40882), this is needed as otherwise all retries will fail when the current session/transaction was "poisened" by the initial error.
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
We run airflow with an HA postgres deployment via patroni on k8s. In response to temporary network problems, for example, there may be a brief failover to the postgres follower, which is in read-only mode. When this happens, write transactions, typically heartbeat updates, fail for a brief period with:
During such events, which typically resolve within a few seconds at the infrastructure level, we observe scheduler crashes and also failing tasks, primarily because heartbeat updates fail. Interestingly, the job do not fail due to the above error, but instead with:
We traced the problem to the
retry_db_transaction
decorator, which issues retries forDBAPIError
(see #19856, andOperationalError
, which is already covered by the former), but only issues a rollback in response toOperationalError
.The
ReadOnlySqlTransaction
error in our case is wrapped in SQLAlchemy'sInternalError
, for which we retry, but do not rollback the transaction. So all retries and eventually the whole operation will fail withInFailedSqlTransaction
, regardless of whether the original problem resolved at the infrastructure level or not.What you think should happen instead?
The
retry_db_transaction
should recover correctly from these issues.How to reproduce
Start airflow with postgres (e.g. via breeze) and briefly toggle
default_transaction_read_only
:Operating System
debian 12 (bookworm)
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Custom HA postgres deployment via patroni
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: