Fix race conditions in task callback invocations #10917

houqp · 2020-09-14T00:32:11Z

This race condition resulted in task success and failure callbacks being
called more than once. Here is the order of events that could lead to
this issue:

task started running within process 2
(process 1) local_task_job checked for task return code, returns None
(process 2) task exited with failure state, task state updated as failed in DB
(process 2) task failure callback invoked through taskinstance.handle_failure method
(process 1) local_task_job heartbeat noticed task state set to
failure, mistoken it as state bing updated externally, also invoked task
failure callback

To avoid this race condition, we need to make sure task callbacks are
only invoked within a single process.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

houqp · 2020-09-14T00:38:45Z

there is one test that i forgot to update, will update it in couple hours.

airflow/jobs/local_task_job.py

kaxil

LGTM.

kaxil · 2020-09-16T12:47:31Z

cc @mik-laj @ashb You might want to take a look at it

ashb

This sounds broadly good -- except there's one case not handled:

if the raw task process dies hard (segfault, OOM killed say) then the failure callback wouldn't be executed.

houqp · 2020-09-18T00:11:47Z

@ashb this is the tradeoff we will have to make here, segfault, OOM kill could happen to local_scheduler_job as well, although the chance is much lower. Either way, we need to make sure callbacks are only invoked from within one process, it's either the scheduler job, or raw_task, but not from both. This is happening in production for us at a high frequency right now, duplicated failure callbacks are invoked multiple times a day, while I don't recall running into hard die scenarios, so I think it's a reasonable tradeoff.

On top of this, the refactoring here only changes behavior of callback invocation triggered by external state change. It's very unlikely that external state are updated right before task got into OOM or segfault. With or without this change, Airflow still invokes failure callback from within raw_task when state are not changed externally, so the problem you mentioned already exists in today's code base.

From a design's point of view, it's better to invoke success and failure callbacks from the task monitor process, e.g. the local scheduler job, but it would require a much bigger refactoring. If that's what the community prefers, I can give that a stab. The run task command needs to be aware of whether it's been invoked with or without an external task monitor, and change the callback invocation logic based on that.

houqp · 2020-09-24T05:08:03Z

changing title back to WIP since I am going to do a big refactor to move all success/callback invocations into callers of _run_raw_task.

github-actions · 2020-10-23T19:55:50Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

kaxil · 2020-12-23T17:45:05Z

Some tests were failing with 137 error, just restarted them.

houqp · 2020-12-24T05:51:40Z

looks like another 2 tests failed with 137 error, let me try to restart again

houqp · 2021-01-08T05:26:56Z

@ashb all tests are passing now, we have been running this patch in production for couple weeks. do you want to do another round of review?

ashb · 2021-01-08T07:56:47Z

I'll take a look. We've been running with this at Astronomer too and haven't had any problems reported either

kaxil · 2021-01-18T15:54:42Z

@houqp Can you please rebase on Master one last time :) -- Thanks

This race condition resulted in task success and failure callbacks being called more than once. Here is the order of events that could lead to this issue: * task started running within process 2 * (process 1) local_task_job checked for task return code, returns None * (process 2) task exited with failure state, task state updated as failed in DB * (process 2) task failure callback invoked through taskinstance.handle_failure method * (process 1) local_task_job heartbeat noticed task state set to failure, mistoken it as state bing updated externally, also invoked task failure callback To avoid this race condition, we need to make sure task callbacks are only invoked within a single process.

houqp · 2021-01-18T23:24:56Z

@kaxil rebased :)

This race condition resulted in task success and failure callbacks being called more than once. Here is the order of events that could lead to this issue: * task started running within process 2 * (process 1) local_task_job checked for task return code, returns None * (process 2) task exited with failure state, task state updated as failed in DB * (process 2) task failure callback invoked through taskinstance.handle_failure method * (process 1) local_task_job heartbeat noticed task state set to failure, mistoken it as state bing updated externally, also invoked task failure callback To avoid this race condition, we need to make sure task callbacks are only invoked within a single process. (cherry picked from commit f1d4f54)

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 14, 2020

houqp requested review from ashb, potiuk, turbaszek, mik-laj and kaxil September 14, 2020 00:32

turbaszek reviewed Sep 14, 2020

View reviewed changes

airflow/jobs/local_task_job.py Show resolved Hide resolved

turbaszek approved these changes Sep 14, 2020

View reviewed changes

houqp force-pushed the qp/race_condition branch 6 times, most recently from defd79e to 4f7a0b1 Compare September 15, 2020 06:00

kaxil approved these changes Sep 16, 2020

View reviewed changes

kaxil changed the title ~~fix race conditions in task callback invocations~~ Fix race conditions in task callback invocations Sep 16, 2020

ashb reviewed Sep 17, 2020

View reviewed changes

turbaszek mentioned this pull request Sep 22, 2020

on_failure_callback not called when task receives termination signal #11086

Closed

houqp changed the title ~~Fix race conditions in task callback invocations~~ WIP: Fix race conditions in task callback invocations Sep 24, 2020

houqp force-pushed the qp/race_condition branch 4 times, most recently from 3d30e51 to 3891d8f Compare October 23, 2020 19:36

houqp force-pushed the qp/race_condition branch 2 times, most recently from 498c893 to 7eac4e1 Compare October 24, 2020 06:38

houqp force-pushed the qp/race_condition branch from a63398d to 499e5c2 Compare December 24, 2020 05:52

houqp force-pushed the qp/race_condition branch from 499e5c2 to 29170fc Compare January 6, 2021 20:05

ashb approved these changes Jan 18, 2021

View reviewed changes

Qingping Hou added 11 commits January 18, 2021 12:18

fix test for success callback

326367d

expand try catch block scope in _run_raw_task

c8c81ea

make pylint happy

6357933

move finished callback invocation into task execution caller

191632d

add more type annotation

e52dba1

pass task run error through error file from ti.handle_task_failure

38c823e

address review, centrailize error loading into load_error_file

230b842

address review feedback

874189b

fix hive tests

a83fa8f

address code review, remove magic number

b2923d6

houqp force-pushed the qp/race_condition branch from 29170fc to b2923d6 Compare January 18, 2021 20:26

kaxil merged commit f1d4f54 into apache:master Jan 18, 2021

potiuk added a commit to PolideaInternal/airflow that referenced this pull request Jan 19, 2021

fixup! Fix race conditions in task callback invocations (apache#10917)

93712ba

houqp deleted the qp/race_condition branch January 19, 2021 18:30

kaxil mentioned this pull request Apr 12, 2021

Execute on_failure_callback when SIGTERM is received #15172

Merged

houqp mentioned this pull request Apr 16, 2021

on_failure_callback does not seem to fire on pod deletion/eviction #14422

Closed

potiuk mentioned this pull request Jan 4, 2022

callback functions not called when a dag run is marked success or failure #18113

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions in task callback invocations #10917

Fix race conditions in task callback invocations #10917

houqp commented Sep 14, 2020

houqp commented Sep 14, 2020

kaxil left a comment

kaxil commented Sep 16, 2020

ashb left a comment

houqp commented Sep 18, 2020

houqp commented Sep 24, 2020 •

edited

Loading

github-actions bot commented Oct 23, 2020

kaxil commented Dec 23, 2020

houqp commented Dec 24, 2020

houqp commented Jan 8, 2021

ashb commented Jan 8, 2021

kaxil commented Jan 18, 2021

houqp commented Jan 18, 2021

Fix race conditions in task callback invocations #10917

Fix race conditions in task callback invocations #10917

Conversation

houqp commented Sep 14, 2020

houqp commented Sep 14, 2020

kaxil left a comment

Choose a reason for hiding this comment

kaxil commented Sep 16, 2020

ashb left a comment

Choose a reason for hiding this comment

houqp commented Sep 18, 2020

houqp commented Sep 24, 2020 • edited Loading

github-actions bot commented Oct 23, 2020

kaxil commented Dec 23, 2020

houqp commented Dec 24, 2020

houqp commented Jan 8, 2021

ashb commented Jan 8, 2021

kaxil commented Jan 18, 2021

houqp commented Jan 18, 2021

houqp commented Sep 24, 2020 •

edited

Loading