[Bug] Cosmos tasks failing to heartbeat and killed eventually as zombies #1324
Labels
area:execution
Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc
bug
Something isn't working
triage-needed
Items need to be reviewed / assigned to milestone
Astronomer Cosmos Version
1.7.1
dbt-core version
NA
Versions of dbt adapters
No response
LoadMode
AUTOMATIC
ExecutionMode
LOCAL
InvocationMode
SUBPROCESS
airflow version
NA
Operating System
NA
If a you think it's an UI issue, what browsers are you seeing the problem on?
No response
Deployment
Amazon (AWS) MWAA
Deployment details
No response
What happened?
Users have reported in the #airflow-dbt channel in the Apache Airflow Slack channel that tasks failing to report a heartbeat and the executor is considering it a zombie process.
Slack conversation: https://apache-airflow.slack.com/archives/C059CC42E9W/p1731162253771519
Relevant log output
How to reproduce
We're awaiting inputs from users to understand the invocation mode being used here. But the initial internal guess is that it maybe the SUBPROCESS Invocation mode being used and @tatiana pointed out that the likely cause for this could be
astronomer-cosmos/cosmos/hooks/subprocess.py
Line 92 in 8ec46d2
wait()
on thesub_process
. The wait call blocks the current execution until the sub_process completes. Irrespective of whether this is the root cause for the issue, we should refactor this piece to usepoll()
instead ofwait()
which would allow us to run it in a non-blocking way & also allowing the task to heartbeat.The refactor could look something like this
Anything else :)?
No response
Are you willing to submit PR?
Contact Details
No response
The text was updated successfully, but these errors were encountered: