-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute on_failure_callback
when SIGTERM is received
#15172
Execute on_failure_callback
when SIGTERM is received
#15172
Conversation
0b8eef8
to
d1658e7
Compare
d1658e7
to
ec2f0c7
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
f7121d9
to
850e583
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a correct fix check my comments in #14422 (comment)
850e583
to
02953ed
Compare
I have updated it with a test |
02953ed
to
a6c3e34
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
d472173
to
ceb390b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sent you an invite to talk about this on Monday
Accepted. Thanks |
@houqp Did you get a chance to take a look |
@kaxil @ephraimbuddy sorry I was in vacation. I commented my analysis and recommended fix in #14422 (comment). @ephraimbuddy could you give that a try? This fix will not cause race condition as far as I can tell, because it is still only executing the callback in a one process (success in local task job and failure in run_raw_task). but it will cause a regression for #11086. |
@ephraimbuddy to simulate the scenario in #11086, we could send a sigkill (9) instead of sigterm (15) in the unit test to force kill the task_runner subprocess. |
tests/jobs/test_local_task_job.py
Outdated
break | ||
time.sleep(0.2) | ||
assert ti.state == State.RUNNING | ||
os.kill(ti.pid, 15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good unit test, thanks for adding it 👍
ceb390b
to
3def8f9
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming suggestions from @kaxil are incorporated.
1ca85c1
to
67224d8
Compare
The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason. |
0a37e45
to
c4e218f
Compare
Currently, on_failure_callback is only called when a task finishes executing not while executing. When a pod is deleted, a SIGTERM is sent to the task and the task is stopped immediately. The task is still running when it was killed and therefore on_failure_callback is not called. This PR makes sure that when a pod is marked for deletion and the task is killed, if the task has on_failure_callback, the callback is called
Co-authored-by: Kaxil Naik <[email protected]>
c4e218f
to
ed39194
Compare
Failing Helm chart test is already fixed in Master by 17c38be |
Currently, on_failure_callback is only called when a task finishes executing not while executing. When a pod is deleted, a SIGTERM is sent to the task and the task is stopped immediately. The task is still running when it was killed and therefore on_failure_callback is not called. This PR makes sure that when a pod is marked for deletion and the task is killed, if the task has on_failure_callback, the callback is called. Closes: #14422 (cherry picked from commit def1e7c)
i know it’s not “exact” but setting tasks to FAILED when there are more reties is soooo “upsetting” to a lot of tasks and dags that it’s not worth it.
Closes: #14422
Currently, on_failure_callback is only called when a task finishes
executing not while executing. When a pod is deleted, a SIGTERM is
sent to the task and the task is stopped immediately. The task is
still running when it was killed and therefore on_failure_callback
is not called.
This PR makes sure that when a pod is marked for deletion and the
task is killed, if the task has on_failure_callback, the callback
is called
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.