-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State of this instance has been externally set to up_for_retry. Terminating instance. #16573
Comments
Does it happen to all tasks or a specific task/dag? |
@eladkal not sure if it is the same issue, but I am seeing the same symptom on 2.1.1, and can reproduce easily by letting my DAGs run for an hour or two. This affects tasks randomly in my DAGs, usually after running for a few minutes. I have a single Celery worker pod which was healthy whenever this has occured so far. Airflow audit log table has no indication that the task state was changed. Here's a sample from a task log, where I have activated SQLAlchemy query logging.
I have reviewed the logs for the scheduler, web server and worker with SQLAlchemy query logging as well in order to try and determine where the state is being altered, and found nothing... there is no At this stage, I am rather puzzled as to what is going on here. Are there more logs/checks I could enable to understand what's going on? |
After more investigation, I was able to get to the bottom of it. I had missed the actual SQLAlchemy queries updating the task instance/task fail, as they are not performed by either the scheduler, web server or worker, but by the DAG processor (as part of file-level callbacks), whose logs are only available locally in my setup. In my case, the scheduler thinks the job is a zombie. I had also missed that as the log message is at
The scheduler is marking the job as This is apparently caused by a combination of two facts:
In this example, 2021-07-27 09:54:36 UTC UPDATE job SET state='failed' WHERE job.state = 'running' AND job.latest_heartbeat < '2021-07-27T09:54:06.555688+00:00'::timestamptz From my limited understanding, there seem to be two issues:
|
Thanks for your detailed explanation @peay This makes lot of sense. I think setting In my case, I was connecting to the database through a load balancer and load balancer sometimes runs out of ports and drops connection causing the db connections from airflow to fail. Some of these connection failures just happened to be the heartbeats from these tasks. Ever since, the db connection failures has been fixed, I'm not seeing this error either in my setup. Thanks once again. |
I just experienced this as well and was able to use @peay's answer. Perhaps there should be some sort of warning or check to make sure |
I didn't review the discussion here yet but if you have a fix (documentation / code change) we welcome PRs. |
Just found failed task with exactly this message. Airflow 2.1.3.
Second attempt log(full):
|
Apache Airflow version: 2.0.1
Kubernetes version (if you are using kubernetes) (use
kubectl version
): 1.18.14Environment:
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:
What happened:
An occasional airflow tasks fails with the following error
There is no indication as to what caused this error. The worker instance is healthy and task did not hit the task timeout.
What you expected to happen:
Task to complete successfully. If a task fad to fail for unavoidable reason (like timeout), it would be helpful to provide the reason for the failure.
How to reproduce it:
I'm not able to reproduce it consistently. It happens every now and then with the same error as provided above.
I'm also wish to know how to debug these failures
The text was updated successfully, but these errors were encountered: