Task set to be failed by scheduler directly, while all up streams success #17370

cctvzd7 · 2021-08-02T07:05:04Z

Apache Airflow version: 2.1.1

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a): Linux e08g09383.cloud.eu13 3.10.0-327.ali2010.rc7.alios7.x86_64 #1 SMP Thu Jun 29 21:45:21 CST 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: docker-compose
Others:
I installed the airflow via docker-compose.yaml, and my docker-compose.yaml is as attach file: docker-compose.txt

What happened:

A task was set to be failed directly by scheduler, while all up streams were success.
The task has not be run, no log info either.

What you expected to happen:

The task should be running

How to reproduce it:
It happens by chance, I can not reproduce it.

Anything else we need to know:
A log from scheduler:

[2021-08-01 21:46:14,003] {scheduler_job.py:600} INFO - Executed failure callback for <TaskInstance: blink_create_cluster.create_master_ecs_task 2021-08-01 13:46:09.195000+00:00 [failed]> in state failed

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2021-08-02T07:05:06Z

Thanks for opening your first issue here! Be sure to follow the issue template!

uranusjr · 2021-08-02T07:37:58Z

Could you post some context around the log? That log line simply says the task has failed, which we already know from your description. Something might have happened before that to trigger the failure state.

cctvzd7 · 2021-08-02T07:49:03Z

There was only one log line about this issue in scheduler log file.

uranusjr · 2021-08-02T07:51:02Z

That’s unfortunate. Maybe someone would be able to spot some pattern.

ephraimbuddy · 2021-08-02T08:33:37Z

@cctvzd7 can you add a dag that can be used to reproduce this?

jedcunningham · 2021-08-03T01:05:50Z

Is there anything in the normal scheduler log (the above is the parsing log)?

cctvzd7 · 2021-08-03T04:10:38Z

@jedcunningham @uranusjr
I enable debug log level in scheduler, and the logs are as follows:

[2021-08-03 11:41:46,088] {scheduler_job.py:643} INFO - DAG(s) dict_keys(['blink_deploy_zprofile']) retrieved from /opt/airflow/dags/dag/blink_deploy_zprofile.py
[2021-08-03 11:41:46,088] {scheduler_job.py:560} DEBUG - Processing Callback Request: {'full_filepath': '/opt/airflow/dags/dag/blink_deploy_zprofile.py', 'msg': 'Executor reports task instance <TaskInstance: blink_deploy_zprofile.add_security_group_rule_task 2021-08-03 03:41:43.527000+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7fa1ca377ba8>, 'is_failure_callback': True}
[2021-08-03 11:41:46,164] {scheduler_job.py:600} INFO - Executed failure callback for <TaskInstance: blink_deploy_zprofile.add_security_group_rule_task 2021-08-03 03:41:43.527000+00:00 [failed]> in state failed

cctvzd7 · 2021-08-03T08:52:14Z

May be related with: #16625

cctvzd7 · 2021-08-05T02:25:09Z

@jedcunningham @uranusjr
Any ideas about this issue?

ephraimbuddy · 2021-08-05T07:35:34Z

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

stijndehaes · 2021-08-10T09:29:29Z

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

ephraimbuddy · 2021-08-10T09:42:02Z

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

Oh, I see, we even had an issue with task being stuck in queued because a POD had an error starting or something else happened, and the executor report that the task has failed while the scheduler still sees it as queued. We made this change #15929 which has not been released to resolve the task getting stuck.
I will take a closer look and see if we can make this work properly

stijndehaes · 2021-08-10T12:43:34Z

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

Oh, I see, we even had an issue with task being stuck in queued because a POD had an error starting or something else happened, and the executor report that the task has failed while the scheduler still sees it as queued. We made this change #15929 which has not been released to resolve the task getting stuck.
I will take a closer look and see if we can make this work properly

Ah I thought this code was already active in our environment but it isn't 😅 If you could look into making this more robust that would be great. If I can provide more info or help please let me know

shubhamg931 · 2021-09-30T19:54:52Z

This is a duplicate of #18401 right?

github-actions · 2023-04-29T07:01:18Z

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions · 2023-05-29T07:01:25Z

This issue has been closed because it has not received response from the issue author.

cctvzd7 added the kind:bug This is a clearly a bug label Aug 2, 2021

eladkal added area:Scheduler including HA (high availability) scheduler affected_version:2.1 Issues Reported for 2.1 labels Sep 21, 2021

shubhamg931 mentioned this issue Sep 30, 2021

downstream tasks "upstream failed" when upstream retries and succeeds #18401

Closed

2 tasks

github-actions bot added the Stale Bug Report label Apr 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task set to be failed by scheduler directly, while all up streams success #17370

Task set to be failed by scheduler directly, while all up streams success #17370

cctvzd7 commented Aug 2, 2021 •

edited by uranusjr

Loading

boring-cyborg bot commented Aug 2, 2021

uranusjr commented Aug 2, 2021

cctvzd7 commented Aug 2, 2021 •

edited

Loading

uranusjr commented Aug 2, 2021

ephraimbuddy commented Aug 2, 2021

jedcunningham commented Aug 3, 2021

cctvzd7 commented Aug 3, 2021 •

edited by uranusjr

Loading

cctvzd7 commented Aug 3, 2021

cctvzd7 commented Aug 5, 2021

ephraimbuddy commented Aug 5, 2021

stijndehaes commented Aug 10, 2021

ephraimbuddy commented Aug 10, 2021

stijndehaes commented Aug 10, 2021

shubhamg931 commented Sep 30, 2021

github-actions bot commented Apr 29, 2023

github-actions bot commented May 29, 2023

Task set to be failed by scheduler directly, while all up streams success #17370

Task set to be failed by scheduler directly, while all up streams success #17370

Comments

cctvzd7 commented Aug 2, 2021 • edited by uranusjr Loading

boring-cyborg bot commented Aug 2, 2021

uranusjr commented Aug 2, 2021

cctvzd7 commented Aug 2, 2021 • edited Loading

uranusjr commented Aug 2, 2021

ephraimbuddy commented Aug 2, 2021

jedcunningham commented Aug 3, 2021

cctvzd7 commented Aug 3, 2021 • edited by uranusjr Loading

cctvzd7 commented Aug 3, 2021

cctvzd7 commented Aug 5, 2021

ephraimbuddy commented Aug 5, 2021

stijndehaes commented Aug 10, 2021

ephraimbuddy commented Aug 10, 2021

stijndehaes commented Aug 10, 2021

shubhamg931 commented Sep 30, 2021

github-actions bot commented Apr 29, 2023

github-actions bot commented May 29, 2023

cctvzd7 commented Aug 2, 2021 •

edited by uranusjr

Loading

cctvzd7 commented Aug 2, 2021 •

edited

Loading

cctvzd7 commented Aug 3, 2021 •

edited by uranusjr

Loading