-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalTaskJob heartbeat race condition with finishing task causing SIGTERM #16227
Comments
Hi everyone! Looks like we are faced with a related issue. Here my shortened debug logs of this:
At this moment the process (PID 468) should have been finished, but as we see in below logs it continued to make heartbeats and kills next poke process:
I found that this bug began to affect our sensors after fixing incorrect validation of recorded pid by commit 817b599. |
Additionally I found that the first process successfully recorded Refreshing TaskInstance message from TaskInstance.refresh_from_db, but not recorded Refreshed TaskInstance message. |
Another report of the same issue: |
For me, it happens also when status is failed and with mini scheduler turned off (but less often) |
I'm able to replicate this consistently with this dag:
|
@Prasnal @millin I have been able to reproduce this only when I have dagrun_timeout set and the dag_run times out. Let me know if you also set dagrun_timeout in your dags. My case is caused by these lines: airflow/airflow/jobs/scheduler_job.py Lines 1730 to 1739 in 5e09926
When the skipped state is set, the local task job sees this as being externally set and terminates the task runner. |
@ephraimbuddy I have no dagrun_timeout |
I don't have it as well |
Workaround for now would be to set |
@kaxil |
In my case - it just reduced amount of errors |
Whoops, in that case it needs more investigation, I thought I read somewhere in this issue that it was caused by mini-scheduling |
I see the issue has been closed, but am still experiencing the issue |
The fix that resolved this issue has not been released yet. |
ahh, Okay I have seen it has been merged to be deployed in 2.1.3 |
the workround is not working for me, I will just wait for the fix |
Yeah. The fix for this is not really the setting, as it turned out. It is fixed in #16301 along with other issues and would be released in 2.1.3. |
I have changed to use the latest image that has those changes but am still seeing the error [2021-08-03 09:45:44,395] {local_task_job.py:187} WARNING - State of this instance has been externally set to removed. Terminating instance. |
Can I see your dag? |
It's a private code, the gist of the logic My dag gets records that have not been published and creates a dynamic task to publish. the publish task ( this task takes time to run up to 30min) that task is the one that keeps on failing. |
was using this image ghcr.io/apache/airflow-main-python3.9-v2:ff75cbcac9ec0b1992b4fddd6c160901f23e0c2a |
Yeah, checking the image it has the change. In your dag, how does the task state change to removed? |
I will also simulate this: |
Not sure how the task change to removed. I have added this config AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: 3600 so that the task will wait until its finished before being terminated. |
You can check the scheduler log when this happens, probably the task went missing and was removed somewhere. That happened externally and I think it might not be related to this heartbeat issue |
Yes from the scheduler logs, I was seeing logs saying cannot find the task. Why would the task go missing but its still processing. |
You can create an issue with a simple dag to reproduce the issue. That way we will figure it out |
Hey @ephraimbuddy we are facing this issue with airflow version 2.2.3 for every single DAG. Error screenshot as below |
@vinit-tribes You should have pasted the whole logs. |
Thanks @ephraimbuddy for the response, now no we aren't using Pasting full error log below
|
Also @ephraimbuddy have created a new bug as well at #20992 |
Apache Airflow version: 2.0.2
Environment:
uname -a
): Linux datadumpprod2 4.15.0-54-generic Disabling backfill functionality? #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/LinuxWhat happened:
After task execution is done but process isn't finished yet, heartbeat callback kills the process because falsely detects external change of state.
This happens more often when mini scheduler is enabled because in such case the window for race condition is bigger (time of execution mini scheduler).
What you expected to happen:
Heartbeat should allow task to finish and shouldn't kill it.
How to reproduce it:
As it's a race condition it happens randomly but to make it more often, you should have mini scheduler enabled and big enough database that execution of mini scheduler takes as long as possible. You can also reduce heartbeat interval to minimum.
The text was updated successfully, but these errors were encountered: