-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow crashes with a psycopg2.errors.DeadlockDetected exception #19957
Comments
Do you happen to have one or multiple schedulers? |
Also - is there a chance to find out from the logs what was the "other" query that the deadlock happened on? Is this also possible that you have another person or process (either manuallly or via some direct DB access) acessing the DB (for example soem scripts that perform retention or similar)? |
(Reason why I am asking - there are very low number of Postgres deadlock reports we receive, because (unlike MySQL) Postgres usually shows "real" deadlocks and we'd expect this kind of problem to appear more often amongst our users, so I am looking for any special circumstances you might have. |
Hmm, it's difficult to retrieve this bit of information. I don't think there were multiple instances of the scheduler running, but I will keep in mind this eventuality, so thank you for the hint. Anyways, I'm not experiencing the problem anymore right now, maybe related to my upgrading the system packages and rebooting. If this bug happens again I will update this thread with more info about SQL query or eventual multiple scheduler instances being run |
unfortunately, this keeps happening (after couple of weeks where it was running smoothly)
I just restarted it and this are the scheduler processes running: I launched it with
And this is the launching script:
Responding to your questions:
|
Limiting concurrency does not solve the issue. Even by reducing the amount of concurrent tasks to 1 the exception is triggered after some time |
Marked it provisionally to 2.2.4 in case we release it (it might go straight to 2.3.0 depending on how serious things there are / whether we haave a fix and how close we are to 2.3.0) |
it seems that by dropping airflow's database entirely and recreating it from scratch, the bug is not re-occurring. So it might have been something in the airflow's db data. |
i take this back: it's actually keeping crashing unfortunately |
We are facing the very same problem with Postgres. Even being a different database, the stack trace shows the same line/method being called on the airflow "layer" before moving to the database concrete class for the #19832
I didn't have time to test, but there is a chance the #20030 "fixes" this one together with the #19832 |
Just upgraded to Airflow 2.2.3, and unfortunately it keeps crashing as well |
I looked at the code and places where it can occur and I have a hypothesis on what could cause it. Could you please disable https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution and see if the problem still occurs? |
To bump it up - Just let us know @stablum @AmarEL if disabling mini-scheduler works. I think this might be causesd by the mini-scheduler in the task deadlocking with the "actual" scheduler trying to lock the same rows in a different order. But I think before we attempt to analyse it in detail and fix it - since you have easily reproducible and repeatable case, disable the mini-scheduler it and see if it helps. If It does, it will help us to narrow down the reason and possibly fix it. The only drawback of disabling the mini-scheduler is potentially slightly longer latency in scheduling subsequent tasks in the same dag. |
cc: @ashb -> do you think this is plausible hypothesis ^^ ? |
Yes, that seems plausible. |
I managed to run a large number tasks without airflow crashing, so changing that setting as you suggested did indeed help! Thank you! :) |
Cool! Thanks for confirming! Now we need to find the root cause and fix it ! |
I am experiencing the same problem and have set |
I actually spent some time few days ago looking at the mini-scheduler code but I could not really find a flaw there. The fact that it did not help you indicates that my hypothesis was unfounded, unfortunately. and maybe the reason was different (and the fact that it worked for @stablum was mainly a coincidence or some side effect of that change). @dwiajik - it might also be that your case is a bit different - could you please report (maybe create a gist with a few examples of) some of the logs of your deadlocks - Ideally if you could send us the logs of failing scheduler and corresponding logs of the Postgres server from the same time - I believe it will be much easier to investigate if we see few examples - and the server logs shoud tell us exactly which two queries deadlocked and this should help us a lot. What we really need is somethiing in hte /var/lib/pgsql/data/pg_log/*.log, there should be entries at the time when then deadlock happens that looks like this:
We need ideally those and some logs around it if possible. |
I am afraid we need to reopen this one. IMHO #20894 has no chance of fixing the problem because it does not change Airflow behaviour really (see discussion in #20894). @dwiajik @stablum if you experience this problem still - I think we really need some server-side logs that will telll us what other query is deadlocking with this one. |
@dwiajik @stablum - is there any chance you have some customisations (plugins ? ) or users running DB operations (backfill? API calls? UI modifications) that might have caused the deadlock? Looking at the code, my intuition tells me that this must have been something external. Having the server logs could help to pin-point it. |
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: #16982 * Fixes: #19957 (cherry picked from commit 6d110b5)
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: #16982 * Fixes: #19957 (cherry picked from commit 6d110b5)
Unfortunately I'm still experiencing this bug with Airflow 2.2.4 (it's crashing every 5-10 mins):
I also set This is my PostgreSQL log:
Edit: still have the bug with 2.2.5 |
In your last log you have some long lines, that seem to be truncated. For example: It would be nice to see the full query - what filters there are, etc. |
Here is the log of an occurence of the crash even after migrating to 2.2.5:
|
The python logs allways seem to show the update query in |
one thing that I noticed is that the crashing query is particularly long, as I have several thousands of tasks in this DAG. And since the query is 22MB, the only way that I have to "paste" it is via wetransfer: https://we.tl/t-x8FM4tI0XR |
I wonder if the SQL IN statement is the one creating problems as it seems anti-pattern-ish: would it be possible to avoid it by using a subquery, end/or a JOIN, maybe? |
Hmm, actually the last recorded query starts with:
So I'm wondering if storing the serialization of such a huge dag is creating problems. Maybe deactivating the serialization would prevent the issue? |
Hmm, it seems that the serialization is something that is done at a certain interval. Might it be that a serialization operation can get in conflict with the subsequent one if the first one is non completed? |
I will try to increase |
I found it:
|
Is the query even longer or is this it? It has no WHERE statement. |
Oh my mistake, I used grep, but the query is multi-line. Here is it with some surrounding SQL context:
|
would it make sense that the |
unfortunately it crashed again and this time one of the deadlocking queries is the following:
|
The UPDATE statement seems to be constant, but I can't figure out where the SELECT statement is coming from. SQLAlchemy is probably obfuscating it a bit also. |
Does this still apply, or did it fail later on? |
It failed, and I'm not finding the I mean, after I wrote what you quoted (8th January) it began crashing again, and today I thought it was because of the |
In prostgre log the block, that starts with |
Question - I am a bit lost where we are currently and since this issue is long and relate to other - likely fixed - problem may I have kind request - can someone create a new issue with the deadlock they are currently experiencing - with all details - including the log of deadlock and logs from the server side from around the deadlock (and some details on how frequent/when it happens + all the usual versioning information) ? I might finally get some time to take a closer look. |
created a new issue to track Deadlock exception #23361 |
I am experiencing this issue in 2.5.1 when I try to change the state of 100 task instances of a dag run through the REST API, I am calling the
|
1 similar comment
I am experiencing this issue in 2.5.1 when I try to change the state of 100 task instances of a dag run through the REST API, I am calling the
|
For those still having this issue, I think you should open a separate issue after testing with the latest release which is 2.6.1 |
Apache Airflow version
2.2.2 (latest released)
Operating System
Ubuntu 21.04 on a VM
Versions of Apache Airflow Providers
root@AI-Research:~/learning_sets/airflow# pip freeze | grep apache-airflow-providers
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-sqlite==2.0.1
Deployment
Other
Deployment details
Airflow is at version 2.2.2
psql (PostgreSQL) 13.5 (Ubuntu 13.5-0ubuntu0.21.04.1)
The dag contains thousands of tasks for data download and preprocessing and preparation which is destined to a mongodb database (so, I'm not using the PostgreSQL inside my tasks).
What happened
[2021-12-01 19:41:57,556] {scheduler_job.py:644} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL: Process 322086 waits for ShareLock on transaction 2391367; blocked by process 340345.
Process 340345 waits for AccessExclusiveLock on tuple (0,26) of relation 19255 of database 19096; blocked by process 340300.
Process 340300 waits for ShareLock on transaction 2391361; blocked by process 322086.
HINT: See server log for query details.
CONTEXT: while updating tuple (1335,10) in relation "task_instance"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/airflow/jobs/scheduler_job.py", line 628, in _execute
self._run_scheduler_loop()
File "/usr/local/lib/python3.9/dist-packages/airflow/jobs/scheduler_job.py", line 709, in _run_scheduler_loop
num_queued_tis = self._do_scheduling(session)
File "/usr/local/lib/python3.9/dist-packages/airflow/jobs/scheduler_job.py", line 792, in _do_scheduling
callback_to_run = self.schedule_dag_run(dag_run, session)
File "/usr/local/lib/python3.9/dist-packages/airflow/jobs/scheduler_job.py", line 1049, in schedule_dag_run
dag_run.schedule_tis(schedulable_tis, session)
File "/usr/local/lib/python3.9/dist-packages/airflow/utils/session.py", line 67, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/airflow/models/dagrun.py", line 898, in schedule_tis
session.query(TI)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/orm/query.py", line 4063, in update
update_op.exec()
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/orm/persistence.py", line 1697, in exec
self._do_exec()
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/orm/persistence.py", line 1895, in _do_exec
self._execute_stmt(update_stmt)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/orm/persistence.py", line 1702, in _execute_stmt
self.result = self.query._execute_crud(stmt, self.mapper)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/orm/query.py", line 3568, in _execute_crud
return conn.execute(stmt, self._params)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1011, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
ret = self._execute_context(
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context
self.handle_dbapi_exception(
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1510, in handle_dbapi_exception
util.raise(
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/util/compat.py", line 182, in raise
raise exception
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
self.dialect.do_execute(
File "/usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL: Process 322086 waits for ShareLock on transaction 2391367; blocked by process 340345.
Process 340345 waits for AccessExclusiveLock on tuple (0,26) of relation 19255 of database 19096; blocked by process 340300.
Process 340300 waits for ShareLock on transaction 2391361; blocked by process 322086.
HINT: See server log for query details.
CONTEXT: while updating tuple (1335,10) in relation "task_instance"
[SQL: UPDATE task_instance SET state=%(state)s WHERE task_instance.dag_id = %(dag_id_1)s AND task_instance.run_id = %(run_id_1)s AND task_instance.task_id IN (%(task_id_1)s, %(task_id_2)s, %(task_id_3)s, %(task_id_4)s, %(task_id_5)s, %(task_id_6)s, %(task_id_7)s, %(task_id_8)s, %(task_id_9)s, %(task_id_10)s, %(task_id_11)s, %(task_id_12)s, %(task_id_13)s, %(task_id_14)s, %(task_id_15)s, %(task_id_16)s, %(task_id_17)s, %(task_id_18)s, %(task_id_19)s, %(task_id_20)s)]
[parameters: {'state': <TaskInstanceState.SCHEDULED: 'scheduled'>, 'dag_id_1': 'download_and_preprocess_sets', 'run_id_1': 'manual__2021-12-01T17:31:23.684597+00:00', 'task_id_1': 'download_1379', 'task_id_2': 'download_1438', 'task_id_3': 'download_1363', 'task_id_4': 'download_1368', 'task_id_5': 'download_138', 'task_id_6': 'download_1432', 'task_id_7': 'download_1435', 'task_id_8': 'download_1437', 'task_id_9': 'download_1439', 'task_id_10': 'download_1457', 'task_id_11': 'download_168', 'task_id_12': 'download_203', 'task_id_13': 'download_782', 'task_id_14': 'download_1430', 'task_id_15': 'download_1431', 'task_id_16': 'download_1436', 'task_id_17': 'download_167', 'task_id_18': 'download_174', 'task_id_19': 'download_205', 'task_id_20': 'download_1434'}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
[2021-12-01 19:41:57,566] {local_executor.py:388} INFO - Shutting down LocalExecutor; waiting for running tasks to finish. Signal again if you don't want to wait.
[2021-12-01 19:42:18,013] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 285470
[2021-12-01 19:42:18,105] {process_utils.py:66} INFO - Process psutil.Process(pid=285470, status='terminated', exitcode=0, started='18:56:21') (285470) terminated with exit code 0
[2021-12-01 19:42:18,106] {scheduler_job.py:655} INFO - Exited execute loop
What you expected to happen
Maybe 24 concurrent processes/tasks are too many?
How to reproduce
reproducibility is challenging, but maybe the exception provides enough info for a fix
Anything else
all the time, after some time the dag is being run
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: