Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlock when rescheduling task #21362

Merged
merged 1 commit into from
Feb 7, 2022

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Feb 6, 2022

The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.

However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.

This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.

Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.

It is likely that this one:


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@potiuk potiuk requested review from ashb, kaxil and XD-DENG as code owners February 6, 2022 14:56
@potiuk
Copy link
Member Author

potiuk commented Feb 6, 2022

@ashb @ephraimbuddy @uranusjr @BasPH - I think I have finally found - thanks to the "server log" provided by our user - the reason for the deadlock that has been plaguing some users (also the nature of the scenario that it is caused by explains why it is so rare and why only some users experience it) . Please take a look.

The gist with log that led me to this hypothesis/solution is here: #16982 (comment)

@potiuk
Copy link
Member Author

potiuk commented Feb 6, 2022

Hey @jedcunningham - if this one is confirmed to fix the deadlock issue, I think it is a very good candidate for 2.2.4 - it's very small and IMHO not at all risky (the worst that can happen is slightly slower rescheduling when AirflowRescheduleException is thrown) and it solves a really nasty edge case, that cannot be workarounded otherwise (If my hypothesis is confirmed that is).

@potiuk
Copy link
Member Author

potiuk commented Feb 6, 2022

How I got to that:

Gist here: https://gist.github.com/tulanowski/fcc8358bad3c8e5d15678639ec015d8b

Query 1 (just a fragment of it) - this is scheduler trying to get the task instances to consider for scheduling:

Failed to execute task (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id,
 task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.start_date AS
dag_run_1.dag_hash AS dag_run_1_dag_hash
FROM task_instance INNER JOIN dag_run AS dag_run_1 ON dag_run_1.dag_id = task_instance.dag_id AND dag_run_1.run_id = task_instance.run_id
WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s
 LIMIT %s FOR UPDATE]

This is this query (TRANSACTION 1 from the "server log"):

SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id,
task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.start_date 
AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS 
.... 

Query 2: This is is TaskReschedule insert which only happens (as far as I checked) when AirlfowRescheduleException is thrown during task execution (TRANSACTION 2 from the server log):

INSERT INTO task_reschedule (task_id, dag_id, run_id, try_number, start_date, end_date, duration, reschedule_date)
VALUES ('raw_sensor.raw_input_shop.ipad_orders_from_popup_store', 'dwh_core.main',
'scheduled__2022-01-27T00:15:00+00:00', 1, '2022-01-28 02:16:08.000687', '2022-01-28 02:16:08.142040', 0.141353e0, '2022-01-28 02:17:08.129220')

Both of them are waiting for this lock:

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 577 page no 2196 n bits 296 index dag_run_dag_id_run_id_key of table `airflow`.`dag_run` trx id 844106125 lock mode S locks rec but not gap waiting
Record lock, heap no 230 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 13; hex 6477685f636f72652e6d61696e; asc dwh_core.main;;
 1: len 30; hex 7363686564756c65645f5f323032322d30312d32375430303a31353a3030; asc scheduled__2022-01-27T00:15:00; (total 36 bytes);
 2: len 4; hex 800198fa; asc     ;;

How I understand this - this one is a lock on index of the dag_run primary key which needs to be updated because we are inserting a row in TaskReschedule, and because of the 'DagRun" relationship in the TaskReschedule object, this one needs to be locked when TaskReschedule related to the same dag_run_id needs to be updated.

So what I think happens:

  1. Scheduler locks the DAGRun entity (but not dag run index ! )
  2. Task throws AirflowRescheduleException and it tries to insert 'task_reschedule' and attempts to get the index lock (but cannot get it because that also needs to get the DagRun entity lock so it waits for it)
  3. The UPDATE TASK_INSTANCE and this one tries to get the index lock - but it can't because the insert already has it

Classic deadlock.

@potiuk potiuk force-pushed the avoid_deadlock_on_task_reschedule branch from f9e5926 to c468a5e Compare February 6, 2022 15:41
The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.

However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.

This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.

Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.

It is likely that this one:

* Fixes: apache#16982
* Fixes: apache#19957
@potiuk potiuk force-pushed the avoid_deadlock_on_task_reschedule branch from c468a5e to ca5d372 Compare February 6, 2022 15:44
@ephraim
Copy link

ephraim commented Feb 6, 2022

You most probably did not ment me. Ephraim... T.Ephraim perhaps or so .. please double check who you tag!

@potiuk
Copy link
Member Author

potiuk commented Feb 6, 2022

You most probably did not ment me. Ephraim... T.Ephraim perhaps or so .. please double check who you tag!

Yep. sorry. Happens I meant @ephraimbuddy . Fee free to mute that discussion

@potiuk
Copy link
Member Author

potiuk commented Feb 7, 2022

@ashb @ephraimbuddy @uranusjr @jedcunningham - keen look on that might be useful, that might be cool to get that one before 2.2.4 (if we have a consensus that this one looks like a plausible explanation + fix).

@ephraim
Copy link

ephraim commented Feb 7, 2022

You most probably did not ment me. Ephraim... T.Ephraim perhaps or so .. please double check who you tag!

Yep. sorry. Happens I meant @ephraimbuddy . Fee free to mute that discussion

Then please don't add me again! By still using @ephraim!

@potiuk
Copy link
Member Author

potiuk commented Feb 7, 2022

Then please don't add me again! By still using @ephraim!

Ah sorry. Really. Copy&paste. Really sorry !

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 7, 2022
@potiuk potiuk merged commit 6d110b5 into apache:main Feb 7, 2022
@potiuk potiuk deleted the avoid_deadlock_on_task_reschedule branch February 7, 2022 19:12
@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Feb 8, 2022
jedcunningham pushed a commit that referenced this pull request Feb 8, 2022
The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.

However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.

This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.

Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.

It is likely that this one:

* Fixes: #16982
* Fixes: #19957

(cherry picked from commit 6d110b5)
jedcunningham pushed a commit that referenced this pull request Feb 10, 2022
The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.

However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.

This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.

Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.

It is likely that this one:

* Fixes: #16982
* Fixes: #19957

(cherry picked from commit 6d110b5)
jedcunningham pushed a commit that referenced this pull request Feb 17, 2022
The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.

However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.

This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.

Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.

It is likely that this one:

* Fixes: #16982
* Fixes: #19957

(cherry picked from commit 6d110b5)
@stablum
Copy link

stablum commented Mar 28, 2022

Hello,

unfortunately I'm still getting deadlocks: #19957 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes
Projects
None yet
6 participants