Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task set to be failed by scheduler directly, while all up streams success #17370

Closed
cctvzd7 opened this issue Aug 2, 2021 · 16 comments
Closed
Labels
affected_version:2.1 Issues Reported for 2.1 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug Stale Bug Report

Comments

@cctvzd7
Copy link

cctvzd7 commented Aug 2, 2021

Apache Airflow version: 2.1.1

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a): Linux e08g09383.cloud.eu13 3.10.0-327.ali2010.rc7.alios7.x86_64 #1 SMP Thu Jun 29 21:45:21 CST 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: docker-compose
  • Others:
    I installed the airflow via docker-compose.yaml, and my docker-compose.yaml is as attach file: docker-compose.txt

What happened:

A task was set to be failed directly by scheduler, while all up streams were success.
The task has not be run, no log info either.

What you expected to happen:

The task should be running

How to reproduce it:
It happens by chance, I can not reproduce it.

Anything else we need to know:
A log from scheduler:

[2021-08-01 21:46:14,003] {scheduler_job.py:600} INFO - Executed failure callback for <TaskInstance: blink_create_cluster.create_master_ecs_task 2021-08-01 13:46:09.195000+00:00 [failed]> in state failed
@cctvzd7 cctvzd7 added the kind:bug This is a clearly a bug label Aug 2, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 2, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@uranusjr
Copy link
Member

uranusjr commented Aug 2, 2021

Could you post some context around the log? That log line simply says the task has failed, which we already know from your description. Something might have happened before that to trigger the failure state.

@cctvzd7
Copy link
Author

cctvzd7 commented Aug 2, 2021

There was only one log line about this issue in scheduler log file.
image

@uranusjr
Copy link
Member

uranusjr commented Aug 2, 2021

That’s unfortunate. Maybe someone would be able to spot some pattern.

@ephraimbuddy
Copy link
Contributor

@cctvzd7 can you add a dag that can be used to reproduce this?

@jedcunningham
Copy link
Member

Is there anything in the normal scheduler log (the above is the parsing log)?

@cctvzd7
Copy link
Author

cctvzd7 commented Aug 3, 2021

@jedcunningham @uranusjr
I enable debug log level in scheduler, and the logs are as follows:

[2021-08-03 11:41:46,088] {scheduler_job.py:643} INFO - DAG(s) dict_keys(['blink_deploy_zprofile']) retrieved from /opt/airflow/dags/dag/blink_deploy_zprofile.py
[2021-08-03 11:41:46,088] {scheduler_job.py:560} DEBUG - Processing Callback Request: {'full_filepath': '/opt/airflow/dags/dag/blink_deploy_zprofile.py', 'msg': 'Executor reports task instance <TaskInstance: blink_deploy_zprofile.add_security_group_rule_task 2021-08-03 03:41:43.527000+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7fa1ca377ba8>, 'is_failure_callback': True}
[2021-08-03 11:41:46,164] {scheduler_job.py:600} INFO - Executed failure callback for <TaskInstance: blink_deploy_zprofile.add_security_group_rule_task 2021-08-03 03:41:43.527000+00:00 [failed]> in state failed

image

@cctvzd7
Copy link
Author

cctvzd7 commented Aug 3, 2021

May be related with: #16625

@cctvzd7
Copy link
Author

cctvzd7 commented Aug 5, 2021

@jedcunningham @uranusjr
Any ideas about this issue?

@ephraimbuddy
Copy link
Contributor

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@stijndehaes
Copy link
Contributor

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

@ephraimbuddy
Copy link
Contributor

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

Oh, I see, we even had an issue with task being stuck in queued because a POD had an error starting or something else happened, and the executor report that the task has failed while the scheduler still sees it as queued. We made this change #15929 which has not been released to resolve the task getting stuck.
I will take a closer look and see if we can make this work properly

@stijndehaes
Copy link
Contributor

May be related with: #16625

If so, then it's resolved in #16301 which will be released in 2.1.3

@ephraimbuddy I don't think that #16301 solves #16625, the code in that PR is only ran when the container has been started. But the issue in #16625 happens when a pod has never started but did fail. I added some more info on the issue #16625 maybe this sheds a bit more light on what might be going on.

Oh, I see, we even had an issue with task being stuck in queued because a POD had an error starting or something else happened, and the executor report that the task has failed while the scheduler still sees it as queued. We made this change #15929 which has not been released to resolve the task getting stuck.
I will take a closer look and see if we can make this work properly

Ah I thought this code was already active in our environment but it isn't 😅 If you could look into making this more robust that would be great. If I can provide more info or help please let me know

@eladkal eladkal added area:Scheduler including HA (high availability) scheduler affected_version:2.1 Issues Reported for 2.1 labels Sep 21, 2021
@shubhamg931
Copy link

This is a duplicate of #18401 right?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

@github-actions
Copy link

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.1 Issues Reported for 2.1 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug Stale Bug Report
Projects
None yet
Development

No branches or pull requests

7 participants