-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix on_failure_callback when task receive SIGKILL #15537
Conversation
c8f4609
to
6ba1a5d
Compare
Can |
6ba1a5d
to
a2e741a
Compare
Have parameterized them except |
This PR fixes a case where a task would not call the on_failure_callback when there's a case of OOM. The issue was that task pid was being set at the wrong place and the local task job heartbeat was not checking the correct pid of the process runner and task. Now, instead of setting the task pid in check_and_change_state_before_execution, it's now set correctly at the _run_raw_task method
a2e741a
to
d24c1e8
Compare
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
The PR will cause dags that use hive operations to fail because they can't get the right PID. the hive task processes info like these
The error log as follows:
The demo with hive operation runs success after deleting all the code of this PR. So please consider improving the PR or repeal it. |
Interesting. cc: @ashb |
@ashb @jedcunningham hello, could you please help to take a look? thanks. |
In that example I don't see pid 30960 anywhere -- do you know what process it was it? I suspect this might be to do with impersonation, rather than anything hive in particular. |
@ashb Sorry for my mistake of copying the wrong logs. I have updated it by using a new airflow task. Agree with you about your suspicion. In my env, the shell-type task can run normally, and meanwhile, tasks that have subprocess all fail to run. |
Hi @huozhanfeng, I'm taking a look at fixing this but would like it if you can give me a reproducible step using Bash or python operator. |
Sure. In my opinion, you can mock a bash or python operator which contains a java sub-process to reproduce it. In my env, python operations that use the apache-sqoop tool fail to run, and hive operators fail to run. |
This PR fixes a case where a task would not call the on_failure_callback
when there's a case of OOM. The issue was that task PID was being set
at the wrong place and the local task job heartbeat was not checking the
correct PID of the process runner and task.
Now, instead of setting the task PID in check_and_change_state_before_execution method,
it's now set correctly at the _run_raw_task method
Closes: #11086
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.