-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions in task callback invocations #10917
Conversation
there is one test that i forgot to update, will update it in couple hours. |
defd79e
to
4f7a0b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds broadly good -- except there's one case not handled:
if the raw task process dies hard (segfault, OOM killed say) then the failure callback wouldn't be executed.
@ashb this is the tradeoff we will have to make here, segfault, OOM kill could happen to local_scheduler_job as well, although the chance is much lower. Either way, we need to make sure callbacks are only invoked from within one process, it's either the scheduler job, or raw_task, but not from both. This is happening in production for us at a high frequency right now, duplicated failure callbacks are invoked multiple times a day, while I don't recall running into hard die scenarios, so I think it's a reasonable tradeoff. On top of this, the refactoring here only changes behavior of callback invocation triggered by external state change. It's very unlikely that external state are updated right before task got into OOM or segfault. With or without this change, Airflow still invokes failure callback from within raw_task when state are not changed externally, so the problem you mentioned already exists in today's code base. From a design's point of view, it's better to invoke success and failure callbacks from the task monitor process, e.g. the local scheduler job, but it would require a much bigger refactoring. If that's what the community prefers, I can give that a stab. The run task command needs to be aware of whether it's been invoked with or without an external task monitor, and change the callback invocation logic based on that. |
changing title back to WIP since I am going to do a big refactor to move all success/callback invocations into callers of _run_raw_task. |
3d30e51
to
3891d8f
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*. |
498c893
to
7eac4e1
Compare
Some tests were failing with 137 error, just restarted them. |
looks like another 2 tests failed with 137 error, let me try to restart again |
a63398d
to
499e5c2
Compare
499e5c2
to
29170fc
Compare
@ashb all tests are passing now, we have been running this patch in production for couple weeks. do you want to do another round of review? |
I'll take a look. We've been running with this at Astronomer too and haven't had any problems reported either |
@houqp Can you please rebase on Master one last time :) -- Thanks |
This race condition resulted in task success and failure callbacks being called more than once. Here is the order of events that could lead to this issue: * task started running within process 2 * (process 1) local_task_job checked for task return code, returns None * (process 2) task exited with failure state, task state updated as failed in DB * (process 2) task failure callback invoked through taskinstance.handle_failure method * (process 1) local_task_job heartbeat noticed task state set to failure, mistoken it as state bing updated externally, also invoked task failure callback To avoid this race condition, we need to make sure task callbacks are only invoked within a single process.
29170fc
to
b2923d6
Compare
@kaxil rebased :) |
This race condition resulted in task success and failure callbacks being called more than once. Here is the order of events that could lead to this issue: * task started running within process 2 * (process 1) local_task_job checked for task return code, returns None * (process 2) task exited with failure state, task state updated as failed in DB * (process 2) task failure callback invoked through taskinstance.handle_failure method * (process 1) local_task_job heartbeat noticed task state set to failure, mistoken it as state bing updated externally, also invoked task failure callback To avoid this race condition, we need to make sure task callbacks are only invoked within a single process. (cherry picked from commit f1d4f54)
This race condition resulted in task success and failure callbacks being
called more than once. Here is the order of events that could lead to
this issue:
failure, mistoken it as state bing updated externally, also invoked task
failure callback
To avoid this race condition, we need to make sure task callbacks are
only invoked within a single process.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.