-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on_failure_callback not called when task receives termination signal #11086
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Related to #10917 ? |
Good find! I think this issue is related, but not quite the same. This particular case is explicitly mentioned in @houqp's response here, and it looks like the change that would rectify this (moving callback handling to |
yeah, those are two different issues. @madison-ookla what executor are you using? If you run into OOM, the raw task process will receive a SIGKILL instead of SIGTERM, which cannot be captured and handled by the process itself. #7025 doesn't solve this problem because it only handles cases where raw task process did not exit by itself or were killed by SIGKILL. I think to properly fix this bug, we will need to move the failure callback invocation into caller of raw task process, e.g. I will update my #10917 PR to cover this. |
Hello, I have similar problem. Any ETA on fixing this? |
@molejnik-mergebit this should have already been fixed in 2.0.1 through #10917. what version are you on? |
@houqp thanks for the heads up, I'm still on old 1.10.12. Will try to update and check. |
@molejnik-mergebit actually I take it back, this bug is not fixed, but it's now fixable after #10917, I believe @ephraimbuddy is working on it. |
Thanks for the update. I got it as well after update to 2.0.1, but I didn't have steps to reproduce it. |
Apache Airflow version: 1.10.7, 1.10.10, 1.10.12
Environment:
uname -a
): DebianWhat happened:
For the last several versions of Airflow, we've noticed that when a task receives a
SIGTERM
signal (currently represented asTask exited with return code Negsignal.SIGKILL
, though previously represented asTask exited with return code -9
), the failure email would be sent, but theon_failure_callback
would not be called.This happened fairly frequently in the past for us as we had tasks that would consume high amounts of memory and occasionally we would have too many running on the same worker and the tasks would be OOM killed. In these instances, we would receive failure emails with the contents
detected as zombie
and theon_failure_callback
would not be called. We were hoping #7025 would resolve this with the most recent upgrade (and we've also taken steps to reduce our memory footprint), but we just had this happen again recently.What you expected to happen:
If a tasks fails (even if the cause of the failure is a lack of resources), I would hope the
on_failure_callback
would still be called.How to reproduce it:
Example DAG setup:
CODE
With the above example, the
Failure_test
should trigger a run of theOOM_test_follower
DAG when it fails. TheOOM_test
DAG when triggered should quickly run out of memory and then not trigger a run of theOOM_test_follower
DAG.Anything else we need to know:
The text was updated successfully, but these errors were encountered: