Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in task callback invocations #10917

Merged
merged 11 commits into from
Jan 18, 2021

Conversation

houqp
Copy link
Member

@houqp houqp commented Sep 14, 2020

This race condition resulted in task success and failure callbacks being
called more than once. Here is the order of events that could lead to
this issue:

  • task started running within process 2
  • (process 1) local_task_job checked for task return code, returns None
  • (process 2) task exited with failure state, task state updated as failed in DB
  • (process 2) task failure callback invoked through taskinstance.handle_failure method
  • (process 1) local_task_job heartbeat noticed task state set to
    failure, mistoken it as state bing updated externally, also invoked task
    failure callback

To avoid this race condition, we need to make sure task callbacks are
only invoked within a single process.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 14, 2020
@houqp
Copy link
Member Author

houqp commented Sep 14, 2020

there is one test that i forgot to update, will update it in couple hours.

@houqp houqp force-pushed the qp/race_condition branch 6 times, most recently from defd79e to 4f7a0b1 Compare September 15, 2020 06:00
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaxil kaxil changed the title fix race conditions in task callback invocations Fix race conditions in task callback invocations Sep 16, 2020
@kaxil
Copy link
Member

kaxil commented Sep 16, 2020

cc @mik-laj @ashb You might want to take a look at it

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds broadly good -- except there's one case not handled:

if the raw task process dies hard (segfault, OOM killed say) then the failure callback wouldn't be executed.

@houqp
Copy link
Member Author

houqp commented Sep 18, 2020

@ashb this is the tradeoff we will have to make here, segfault, OOM kill could happen to local_scheduler_job as well, although the chance is much lower. Either way, we need to make sure callbacks are only invoked from within one process, it's either the scheduler job, or raw_task, but not from both. This is happening in production for us at a high frequency right now, duplicated failure callbacks are invoked multiple times a day, while I don't recall running into hard die scenarios, so I think it's a reasonable tradeoff.

On top of this, the refactoring here only changes behavior of callback invocation triggered by external state change. It's very unlikely that external state are updated right before task got into OOM or segfault. With or without this change, Airflow still invokes failure callback from within raw_task when state are not changed externally, so the problem you mentioned already exists in today's code base.

From a design's point of view, it's better to invoke success and failure callbacks from the task monitor process, e.g. the local scheduler job, but it would require a much bigger refactoring. If that's what the community prefers, I can give that a stab. The run task command needs to be aware of whether it's been invoked with or without an external task monitor, and change the callback invocation logic based on that.

@houqp
Copy link
Member Author

houqp commented Sep 24, 2020

changing title back to WIP since I am going to do a big refactor to move all success/callback invocations into callers of _run_raw_task.

@houqp houqp changed the title Fix race conditions in task callback invocations WIP: Fix race conditions in task callback invocations Sep 24, 2020
@houqp houqp force-pushed the qp/race_condition branch 4 times, most recently from 3d30e51 to 3891d8f Compare October 23, 2020 19:36
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

@houqp houqp force-pushed the qp/race_condition branch 2 times, most recently from 498c893 to 7eac4e1 Compare October 24, 2020 06:38
@kaxil
Copy link
Member

kaxil commented Dec 23, 2020

Some tests were failing with 137 error, just restarted them.

@houqp
Copy link
Member Author

houqp commented Dec 24, 2020

looks like another 2 tests failed with 137 error, let me try to restart again

@houqp houqp force-pushed the qp/race_condition branch from a63398d to 499e5c2 Compare December 24, 2020 05:52
@houqp houqp force-pushed the qp/race_condition branch from 499e5c2 to 29170fc Compare January 6, 2021 20:05
@houqp
Copy link
Member Author

houqp commented Jan 8, 2021

@ashb all tests are passing now, we have been running this patch in production for couple weeks. do you want to do another round of review?

@ashb
Copy link
Member

ashb commented Jan 8, 2021

I'll take a look. We've been running with this at Astronomer too and haven't had any problems reported either

@kaxil
Copy link
Member

kaxil commented Jan 18, 2021

@houqp Can you please rebase on Master one last time :) -- Thanks

Qingping Hou added 11 commits January 18, 2021 12:18
This race condition resulted in task success and failure callbacks being
called more than once. Here is the order of events that could lead to
this issue:

* task started running within process 2
* (process 1) local_task_job checked for task return code, returns None
* (process 2) task exited with failure state, task state updated as failed in DB
* (process 2) task failure callback invoked through taskinstance.handle_failure method
* (process 1) local_task_job heartbeat noticed task state set to
  failure, mistoken it as state bing updated externally, also invoked task
  failure callback

To avoid this race condition, we need to make sure task callbacks are
only invoked within a single process.
@houqp houqp force-pushed the qp/race_condition branch from 29170fc to b2923d6 Compare January 18, 2021 20:26
@houqp
Copy link
Member Author

houqp commented Jan 18, 2021

@kaxil rebased :)

@kaxil kaxil merged commit f1d4f54 into apache:master Jan 18, 2021
potiuk added a commit to PolideaInternal/airflow that referenced this pull request Jan 19, 2021
@houqp houqp deleted the qp/race_condition branch January 19, 2021 18:30
kaxil pushed a commit that referenced this pull request Jan 21, 2021
This race condition resulted in task success and failure callbacks being
called more than once. Here is the order of events that could lead to
this issue:

* task started running within process 2
* (process 1) local_task_job checked for task return code, returns None
* (process 2) task exited with failure state, task state updated as failed in DB
* (process 2) task failure callback invoked through taskinstance.handle_failure method
* (process 1) local_task_job heartbeat noticed task state set to
  failure, mistoken it as state bing updated externally, also invoked task
  failure callback

To avoid this race condition, we need to make sure task callbacks are
only invoked within a single process.

(cherry picked from commit f1d4f54)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge priority:critical Showstopper bug that should be patched immediately
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants