-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deadlock when rescheduling task #21362
Conversation
@ashb @ephraimbuddy @uranusjr @BasPH - I think I have finally found - thanks to the "server log" provided by our user - the reason for the deadlock that has been plaguing some users (also the nature of the scenario that it is caused by explains why it is so rare and why only some users experience it) . Please take a look. The gist with log that led me to this hypothesis/solution is here: #16982 (comment) |
3b43c87
to
f9e5926
Compare
Hey @jedcunningham - if this one is confirmed to fix the deadlock issue, I think it is a very good candidate for 2.2.4 - it's very small and IMHO not at all risky (the worst that can happen is slightly slower rescheduling when AirflowRescheduleException is thrown) and it solves a really nasty edge case, that cannot be workarounded otherwise (If my hypothesis is confirmed that is). |
How I got to that: Gist here: https://gist.github.com/tulanowski/fcc8358bad3c8e5d15678639ec015d8b Query 1 (just a fragment of it) - this is scheduler trying to get the task instances to consider for scheduling:
This is this query (TRANSACTION 1 from the "server log"):
Query 2: This is is TaskReschedule insert which only happens (as far as I checked) when AirlfowRescheduleException is thrown during task execution (TRANSACTION 2 from the server log):
Both of them are waiting for this lock:
How I understand this - this one is a lock on index of the dag_run primary key which needs to be updated because we are inserting a row in TaskReschedule, and because of the 'DagRun" relationship in the TaskReschedule object, this one needs to be locked when TaskReschedule related to the same dag_run_id needs to be updated. So what I think happens:
Classic deadlock. |
f9e5926
to
c468a5e
Compare
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: apache#16982 * Fixes: apache#19957
c468a5e
to
ca5d372
Compare
You most probably did not ment me. Ephraim... T.Ephraim perhaps or so .. please double check who you tag! |
Yep. sorry. Happens I meant @ephraimbuddy . Fee free to mute that discussion |
@ashb @ephraimbuddy @uranusjr @jedcunningham - keen look on that might be useful, that might be cool to get that one before 2.2.4 (if we have a consensus that this one looks like a plausible explanation + fix). |
Then please don't add me again! By still using @ephraim! |
Ah sorry. Really. Copy&paste. Really sorry ! |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: #16982 * Fixes: #19957 (cherry picked from commit 6d110b5)
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: #16982 * Fixes: #19957 (cherry picked from commit 6d110b5)
The scheduler job performs scheduling after locking the "scheduled" DagRun row for writing. This should prevent from modifying DagRun and related task instances by another scheduler or "mini-scheduler" run after task is completed. However there is apparently one more case where the DagRun is being locked by "Task" processes - namely when task throws AirflowRescheduleException. In this case a new "TaskReschedule" entity is inserted into the database and it also performs lock on the DagRun (because TaskReschedule has "DagRun" relationship. This PR modifies handling the AirflowRescheduleException to obtain the very same DagRun lock before it attempts to insert TaskReschedule entity. Seems that TaskReschedule is the only one that has this relationship so likely all the misterious SchedulerJob deadlock cases we experienced might be explained (and fixed) by this one. It is likely that this one: * Fixes: #16982 * Fixes: #19957 (cherry picked from commit 6d110b5)
Hello, unfortunately I'm still getting deadlocks: #19957 (comment) |
The scheduler job performs scheduling after locking the "scheduled"
DagRun row for writing. This should prevent from modifying DagRun
and related task instances by another scheduler or "mini-scheduler"
run after task is completed.
However there is apparently one more case where the DagRun is being
locked by "Task" processes - namely when task throws
AirflowRescheduleException. In this case a new "TaskReschedule"
entity is inserted into the database and it also performs lock
on the DagRun (because TaskReschedule has "DagRun" relationship.
This PR modifies handling the AirflowRescheduleException to obtain the
very same DagRun lock before it attempts to insert TaskReschedule
entity.
Seems that TaskReschedule is the only one that has this relationship
so likely all the misterious SchedulerJob deadlock cases we
experienced might be explained (and fixed) by this one.
It is likely that this one:
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.