-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task is not retried when worker pod fails to start #16625
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
We're also impacted by this bug. We're running hundreds of tasks every hour and every day we're ending up with multiple task instances stuck in the queued state. We currently have to clear out the state of all those queued tasks, so that they are picked up again. Apache Airflow version: 2.0.2 Kubernetes version: |
We noticed this issue with Airflow 2.1.2. Job went from queued to failed without retry, looking at the code I am not sure how to fix it. It is clear that in The relevant logs (the dag and task name are erased because they might contain sensitive information):
Kubernetes version (EKS): |
This PR #15929 fixed the issue of not having to clear the task before it can be rerun again and that is a major issue otherwise the task is stuck as explained by @mmazek above @stijndehaes, can you check the log at @ashb @jhtimmins, I'm now thinking that this change #15929 should not be released yet. Though it clears tasks from being stuck in the queued/up_for_retry state, it sets them to failed state without checking if they have retries. I'm wondering if there's a better way to do it? |
@ephraimbuddy I have found a way to consistently trigger the issue using the dag attached below.
|
The log of the scheduler is the following pattern repeated:
|
@ephraimbuddy So the metadata database is seeing the task as failed (as was reported by the executor) so that's what's showing up in the UI, but the scheduler still thinks it's queued, so it never attempts to retry? Am I understanding the behavior correctly? If so, making sure that #15929 includes the task retrying if it has retries left will be key. Otherwise will the behavior difference even be noticeable to the user? |
I think retrying is a lesser evil to getting stuck in queued. That was why I added #15929 which will be released in 2.1.3 The problem is when the executor reports that this task has failed and the scheduler sees it as queued, without #15929 it gets stuck in queued(even in the UI) and at times in up_for_retry(if it has retry) but never run again. It's also failed in some cases as @luqic said. But if it's stuck in So without #15929, the task state would be set in |
I can reproduce this issue like this: Use this dag on 2.1.1: from datetime import timedelta
from kubernetes.client import models as k8s
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
with DAG(
dag_id="pending",
schedule_interval=None,
start_date=days_ago(2),
) as dag:
BashOperator(
task_id="forever_pending",
bash_command="date; sleep 30; date",
retries=3,
retry_delay=timedelta(seconds=30),
executor_config={
"pod_override": k8s.V1Pod(
spec=k8s.V1PodSpec(
containers=[
k8s.V1Container(
name="base",
volume_mounts=[
k8s.V1VolumeMount(mount_path="/foo/", name="vol")
],)],
volumes=[
k8s.V1Volume(
name="vol",
persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource(
claim_name="missing"
),)],)),},) And here is the scheduler log from around the failure
|
…17819) When a task fails to start, the executor fails it and its state in scheduler is queued while its state in executor is failed. Currently we fail this task without retries to avoid getting stuck. This PR changes this to only fail the task if the callback cannot be executed. This ensures the task does not get stuck closes: #16625 Co-authored-by: Kaxil Naik <[email protected]>
Apache Airflow version: 2.0.2
Kubernetes version:
What happened:
After the worker pod for the task failed to start, the task is marked as failed with the error message
Executor reports task instance <TaskInstance: datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries 2021-06-14 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
. The task should have been reattempted as it still has retries left.What you expected to happen:
The task status should have been set as
up_for_retry
instead of failing immediately.Anything else we need to know:
This error has occurred 6 times over the past 2 months, and to seemingly random tasks in different DAGs. We run 60 DAGs with 50-100 tasks each every 30 minutes. The affected tasks are a mix of PythonOperator and SparkSubmitOperator. The first time we saw it was in mid Apr, and we were on Airflow version 2.0.1. We upgraded to Airflow version 2.0.2 in early May, and the error has occurred 3 more times since then.
Also, the issue where the worker pod cannot start is a common error that we frequently encounter, but in most cases these tasks are correctly marked as
up_for_retry
and reattempted.This is currently not a big issue for us since it's so rare, but we have to manually clear the tasks that failed to get them to rerun because the tasks are not retrying. They have all succeeded on the first try after clearing.
Also, I'm not sure if this issue is related to #10790 or #16285, so I just created a new one. It's not quite the same as #10790 because the tasks affected are not ExternalTaskSensors, and also #16285 because the offending lines pointed out there are not in 2.0.2.
Thanks!
The text was updated successfully, but these errors were encountered: