Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix schedule_downstream_tasks bug #42582

Merged
merged 3 commits into from
Oct 23, 2024

Conversation

luoyuliuyin
Copy link
Contributor

@luoyuliuyin luoyuliuyin commented Sep 30, 2024

closes: #42581

Problem Description

image
The trigger_rule of task_one_success is one_success. When the upstream node of task_one_success has not yet run, task_one_success is skipped. According to the semantics of one_success, task_one_success should be able to run.

In this scenario, Airflow turns on the schedule_after_task_execution parameter, which means that after the upstream node finishes running, it will try to schedule the downstream node in the current worker.

This problem may occur when task_1 runs faster than task_run. More specifically, it occurs when task_1 finishes running and successfully schedules downstream tasks in the current worker.

Related Code

Below is the code in question
image
image
When task_1 is finished, it will try to schedule downstream tasks. First, a partial dag will be generated.

partial_dag = task.dag.partial_subset(
                task.downstream_task_ids,
                include_downstream=True,
                include_upstream=False,
                include_direct_upstream=True,
            )

task => "task_1"
task.downstream_task_ids => "task_2"

include_downstream=True => ["task_2"]
include_upstream=False => ["task_2"]
include_direct_upstream=True => ["task_2", "task_skip", "task_one_success", "task_1"]

So the final partial_dag is ["task_2", "task_skip", "task_one_success", "task_1"]
image
image

This partial_dag is incomplete because task_one_success's other upstream node task_run is not in it.Specifically, the include_upstream parameter should not be false

Solution

The correct subgraph division should be as follows, include_upstream=True:

partial_dag = task.dag.partial_subset(
                task.downstream_task_ids,
                include_downstream=True,
                include_upstream=True,
                include_direct_upstream=True,
            )

task => "task_1"
task.downstream_task_ids => "task_2"

include_downstream=True => ["task_2"]
include_upstream=True =>["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]
include_direct_upstream=True => ["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]

So the final partial_dag is ["task_2", "task_skip", "task_one_success", "task_1", "task_run", "branch"]

The final partial_dag should be as follows:
image
image

Subgraph pruning will only be performed when the schedule_after_task_execution parameter is turned on. Normal scheduler scheduling will not have this problem.

@romsharon98
Copy link
Contributor

Can you add test that prevent regression?

@luoyuliuyin luoyuliuyin force-pushed the fix_schedule_downstream_tasks_bug branch 2 times, most recently from f2f8ab2 to a4328a7 Compare September 30, 2024 16:57
@luoyuliuyin
Copy link
Contributor Author

luoyuliuyin commented Sep 30, 2024

Can you add test that prevent regression?

Thank you for the suggestion. I've added test to prevent regression. Please check the latest commit.

@jscheffl jscheffl added this to the Airflow 2.10.3 milestone Sep 30, 2024
@jscheffl jscheffl added area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes area:core labels Sep 30, 2024
@luoyuliuyin luoyuliuyin force-pushed the fix_schedule_downstream_tasks_bug branch 3 times, most recently from e095405 to fd928db Compare October 9, 2024 12:48
@luoyuliuyin
Copy link
Contributor Author

Adding include_upstream=True is still unsafe. Here is a bad case where task_one_success is ignored again. Given that partial_subset has little impact on performance, I recommend removing partial_subset completely.

image
task => "task_0"
task.downstream_task_ids => "task_1"

include_downstream=True => ["task_1", "task_2"]
include_upstream=True =>["task_0", "task_1", "task_2"]
include_direct_upstream=True => ["task_0", "task_1", "task_2", "task_one_success", "task_skip"]

So the final partial_dag is ["task_0", "task_1", "task_2", "task_one_success", "task_skip"]

@luoyuliuyin luoyuliuyin force-pushed the fix_schedule_downstream_tasks_bug branch 2 times, most recently from 1708d5f to a45a61b Compare October 10, 2024 12:25
@shahar1 shahar1 self-requested a review October 11, 2024 06:17
@shahar1
Copy link
Contributor

shahar1 commented Oct 11, 2024

Also, DB tests currently fail

@luoyuliuyin luoyuliuyin force-pushed the fix_schedule_downstream_tasks_bug branch from a45a61b to eb5833b Compare October 11, 2024 12:01
Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'm ok with merging it after resolving my last nitpick.
@potiuk / @uranusjr / @ephraimbuddy - any objections?

airflow/models/taskinstance.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Member

potiuk commented Oct 14, 2024

I have completely no problem. The use of partial_subset caused problems with serialization already for a number of users so removing it seems like a good idaa, and performance-wise it should not be problematic.

@ashb @uranusjr ? Do You have any objections?

@potiuk potiuk merged commit 3fceaa6 into apache:main Oct 23, 2024
52 checks passed
potiuk pushed a commit to potiuk/airflow that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
(cherry picked from commit 3fceaa6)
potiuk added a commit that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
(cherry picked from commit 3fceaa6)

Co-authored-by: luoyuliuyin <[email protected]>
harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this pull request Oct 23, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
PaulKobow7536 pushed a commit to PaulKobow7536/airflow that referenced this pull request Oct 24, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
utkarsharma2 pushed a commit that referenced this pull request Oct 24, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
(cherry picked from commit 3fceaa6)

Co-authored-by: luoyuliuyin <[email protected]>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
* fix schedule_downstream_tasks bug

* remove partial_subset

* Update comment

---------

Co-authored-by: 维湘 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

one_success trigger_rule scheduling exception
5 participants