Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute on_failure_callback when SIGTERM is received #15172

Merged
merged 11 commits into from
Apr 23, 2021

Conversation

ephraimbuddy
Copy link
Contributor

@ephraimbuddy ephraimbuddy commented Apr 3, 2021

Closes: #14422

Currently, on_failure_callback is only called when a task finishes
executing not while executing. When a pod is deleted, a SIGTERM is
sent to the task and the task is stopped immediately. The task is
still running when it was killed and therefore on_failure_callback
is not called.

This PR makes sure that when a pod is marked for deletion and the
task is killed, if the task has on_failure_callback, the callback
is called


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from 0b8eef8 to d1658e7 Compare April 3, 2021 08:06
@ephraimbuddy ephraimbuddy marked this pull request as draft April 3, 2021 10:00
@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from d1658e7 to ec2f0c7 Compare April 4, 2021 16:57
@github-actions
Copy link

github-actions bot commented Apr 4, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from f7121d9 to 850e583 Compare April 5, 2021 09:22
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a correct fix check my comments in #14422 (comment)

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from 850e583 to 02953ed Compare April 8, 2021 18:44
@ephraimbuddy ephraimbuddy marked this pull request as ready for review April 8, 2021 18:46
@ephraimbuddy
Copy link
Contributor Author

I don't think this is a correct fix check my comments in #14422 (comment)

I have updated it with a test

@ephraimbuddy ephraimbuddy requested a review from kaxil April 8, 2021 18:48
@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from 02953ed to a6c3e34 Compare April 8, 2021 22:11
@github-actions
Copy link

github-actions bot commented Apr 8, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from d472173 to ceb390b Compare April 9, 2021 11:26
@ephraimbuddy ephraimbuddy reopened this Apr 9, 2021
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent you an invite to talk about this on Monday

@ephraimbuddy
Copy link
Contributor Author

Sent you an invite to talk about this on Monday

Accepted. Thanks

@kaxil kaxil requested a review from houqp April 12, 2021 16:38
@kaxil
Copy link
Member

kaxil commented Apr 12, 2021

@houqp Can you take a look at this please since you worked on #10917 coz this can probably lead to the same race condition as the one you fixed

@kaxil
Copy link
Member

kaxil commented Apr 14, 2021

@houqp Did you get a chance to take a look

@houqp
Copy link
Member

houqp commented Apr 16, 2021

@kaxil @ephraimbuddy sorry I was in vacation. I commented my analysis and recommended fix in #14422 (comment). @ephraimbuddy could you give that a try?

This fix will not cause race condition as far as I can tell, because it is still only executing the callback in a one process (success in local task job and failure in run_raw_task). but it will cause a regression for #11086.

@houqp
Copy link
Member

houqp commented Apr 16, 2021

@ephraimbuddy to simulate the scenario in #11086, we could send a sigkill (9) instead of sigterm (15) in the unit test to force kill the task_runner subprocess.

break
time.sleep(0.2)
assert ti.state == State.RUNNING
os.kill(ti.pid, 15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good unit test, thanks for adding it 👍

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming suggestions from @kaxil are incorporated.

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from 1ca85c1 to 67224d8 Compare April 21, 2021 06:39
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from 0a37e45 to c4e218f Compare April 23, 2021 15:59
ephraimbuddy and others added 11 commits April 23, 2021 18:13
Currently, on_failure_callback is only called when a task finishes
executing not while executing. When a pod is deleted, a SIGTERM is
sent to the task and the task is stopped immediately. The task is
still running when it was killed and therefore on_failure_callback
is not called.

This PR makes sure that when a pod is marked for deletion and the
task is killed, if the task has on_failure_callback, the callback
is called
@ephraimbuddy ephraimbuddy force-pushed the call_on_failure_on_signal_receipt branch from c4e218f to ed39194 Compare April 23, 2021 17:13
@kaxil
Copy link
Member

kaxil commented Apr 23, 2021

Failing Helm chart test is already fixed in Master by 17c38be

@kaxil kaxil merged commit def1e7c into apache:master Apr 23, 2021
@kaxil kaxil deleted the call_on_failure_on_signal_receipt branch April 23, 2021 22:47
potiuk pushed a commit that referenced this pull request May 9, 2021
Currently, on_failure_callback is only called when a task finishes
executing not while executing. When a pod is deleted, a SIGTERM is
sent to the task and the task is stopped immediately. The task is
still running when it was killed and therefore on_failure_callback
is not called.

This PR makes sure that when a pod is marked for deletion and the
task is killed, if the task has on_failure_callback, the callback
is called.

Closes: #14422

(cherry picked from commit def1e7c)
@ashb ashb modified the milestones: Airflow 2.0.3, Airflow 2.1 May 18, 2021
wyndhblb added a commit to asappinc/incubator-airflow that referenced this pull request Jul 18, 2021
i know it’s not “exact” but setting tasks to FAILED when there are more reties is soooo “upsetting” to a lot of tasks and dags that it’s not worth it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

on_failure_callback does not seem to fire on pod deletion/eviction
4 participants