Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"deadlock detected" error in dag table updates #21947

Closed
1 of 2 tasks
imbaczek opened this issue Mar 2, 2022 · 6 comments
Closed
1 of 2 tasks

"deadlock detected" error in dag table updates #21947

imbaczek opened this issue Mar 2, 2022 · 6 comments
Labels
area:core kind:bug This is a clearly a bug

Comments

@imbaczek
Copy link
Contributor

imbaczek commented Mar 2, 2022

Apache Airflow version

2.0.2

What happened

2022-02-26 15:42:51 UTC [104192]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 STATEMENT: UPDATE dag SET next_dagrun='2022-02-26T15:32:50.428672+00:00'::timestamptz, next_dagrun_create_after='2022-02-26T15:42:50.428672+00:00'::timestamptz WHERE dag.dag_id = 'redacted_1'
2022-02-26 15:43:07 UTC [130186]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 ERROR: deadlock detected
2022-02-26 15:43:07 UTC [130186]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 DETAIL: Process 130186 waits for ShareLock on transaction 1728784371; blocked by process 130573.
Process 130573 waits for ShareLock on transaction 1728784356; blocked by process 130186.
Process 130186: UPDATE dag SET next_dagrun='2022-02-26T15:42:47.604016+00:00'::timestamptz, next_dagrun_create_after='2022-02-26T15:52:47.604016+00:00'::timestamptz WHERE dag.dag_id = 'redacted_1'
Process 130573: UPDATE dag SET last_parsed_time='2022-02-26T15:43:03.633626+00:00'::timestamptz, next_dagrun='2022-02-26T15:40:03.633649+00:00'::timestamptz WHERE dag.dag_id = 'redacted_2'
2022-02-26 15:43:07 UTC [130186]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 HINT: See server log for query details.
2022-02-26 15:43:07 UTC [130186]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 CONTEXT: while updating tuple (4,2) in relation "dag"
2022-02-26 15:43:07 UTC [130186]: user=prod_airflow,db=prod_airflow,app=[unknown],client=10.80.166.208 STATEMENT: UPDATE dag SET next_dagrun='2022-02-26T15:42:47.604016+00:00'::timestamptz, next_dagrun_create_after='2022-02-26T15:52:47.604016+00:00'::timestamptz WHERE dag.dag_id = 'redacted_1'

Likely places which conflict in the code:

orm_dag.last_parsed_time = timezone.utcnow()

self.next_dagrun_create_after = next_dagrun_info.run_after

What you expected to happen

No response

How to reproduce

We have 830 dags and this number keeps growing due to business requirements. No special steps to reproduce are required, it just happens every now and then, as would be expected. Increase fillfactor, perhaps?

Operating System

VERSION="20.04.3 LTS (Focal Fossa)"

Versions of Apache Airflow Providers

apache-airflow-providers-celery==1.0.1
apache-airflow-providers-ftp==1.0.1
apache-airflow-providers-http==1.1.0
apache-airflow-providers-imap==1.0.1
apache-airflow-providers-sqlite==1.0.1

Deployment

Other Docker-based deployment

Deployment details

$ docker --version
Docker version 19.03.5, build 633a0ea838

Postgres: psql (11.5 (Ubuntu 11.5-1.pgdg16.04+1))

Anything else

Multiple times a day.

Took remediation steps to reduce amount of rows on page:

prod_airflow=# alter table dag set (fillfactor = 10);
ALTER TABLE
prod_airflow=# vacuum full dag;
VACUUM

which reduced frequency to ~1/day.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@imbaczek imbaczek added area:core kind:bug This is a clearly a bug labels Mar 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 2, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@imbaczek imbaczek changed the title "deadlock detected" error "deadlock detected" error in dag table updates Mar 2, 2022
@ephraimbuddy
Copy link
Contributor

I have seen this error with multi-scheduler setup. How many schedulers are you using?

@imbaczek
Copy link
Contributor Author

imbaczek commented Mar 4, 2022

Just one active scheduler at a time.

@ephraimbuddy
Copy link
Contributor

Deadlocks are fine except if they caused the scheduler to stop. From the log above, this happens in your database and does not stop your scheduler. Looks similar to this fix #19842

@potiuk
Copy link
Member

potiuk commented Mar 6, 2022

Yep. duplicate of #19842

@potiuk potiuk closed this as completed Mar 6, 2022
@farvour
Copy link

farvour commented Nov 16, 2022

Also seeing this in the same table with a massive number of DAGs (3k+). We are running multi-scheduler. Will #19842 resolve this? We're on 2.3.3 at this time.

sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 25726 waits for ShareLock on transaction 374820848; blocked by process 26495.
Process 26495 waits for ShareLock on transaction 374820791; blocked by process 25726.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (7,13) in relation "dag"

[SQL: UPDATE dag SET next_dagrun=%(next_dagrun)s, next_dagrun_data_interval_start=%(next_dagrun_data_interval_start)s, next_dagrun_data_interval_end=%(next_dagrun_data_interval_end)s, next_dagrun_create_after=%(next_dagrun_create_after)s WHERE dag.dag_id = %(dag_dag_id)s]
[parameters: {'next_dagrun': DateTime(2022, 11, 16, 1, 48, 0, tzinfo=Timezone('UTC')), 'next_dagrun_data_interval_start': DateTime(2022, 11, 16, 1, 48, 0, tzinfo=Timezone('UTC')), 'next_dagrun_data_interval_end': DateTime(2022, 11, 16, 1, 50, 0, tzinfo=Timezone('UTC')), 'next_dagrun_create_after': DateTime(2022, 11, 16, 1, 50, 0, tzinfo=Timezone('UTC')), 'dag_dag_id': '<redacted>'}]

By itself, the deadlocks don't feel like a problem, but when they throw an uncaught exception, the process exits. Since we're running the schedulers in k8s, it trips the restart counter quite a bit, and the more often it happens the higher that goes... not good.

Here's a snip of some of the stack trace. It got clipped since this isn't being piped into the termination log so it's capped at the k8s limit:

  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1389, in _execute_clauseelement
    ret = self._execute_context(
  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1748, in _execute_context
    self._handle_dbapi_exception(
  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "/virtualenvs/airflow/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute
    cursor.execute(statement, parameters)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

4 participants