Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3607] Only query DB once per DAG run for TriggerRuleDep #11010

Closed
wants to merge 2 commits into from

Conversation

yuqian90
Copy link
Contributor

This is cherry-picked from #4751 for v1-10-test.

Some investigation into the issue reported in #10790 led to the discovery that this loop in scheduler_job.py takes almost 90% of the time in SchedulerJob.process_file() for large DAGs (around 500 tasks). This causes the DagFileProcessor spawned by the scheduler to go slowly. The reason this loop is slow is that it creates a new DepContext for every ti. And every DepContext needs to populate its own finished_tasks even though this list is the same for every DagRun.

            for ti in tis:
                task = dag.get_task(ti.task_id)

                # fixme: ti.task is transient but needs to be set
                ti.task = task

                if ti.are_dependencies_met(
                        dep_context=DepContext(flag_upstream_failed=True),
                        session=session):
                    self.log.debug('Queuing task: %s', ti)
                    task_instances_list.append(ti.key)

This is the flamegraph generated by py-spy showing the performance of DagFileProcessor in Airflow 1.10.12 before this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg

This is the performance after 1.10.12 is patched with this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg

The nice thing is that #4751 already addressed this issue for master branch. We just need to cherry-pick it to fix this in 1.10.* with some very minor conflict fixes.

While this PR will not fix every scenario that causes #10790, it does reduce the DagFileProcessor time from around 100s to just about 12s for our use case (a DAG with about 500 tasks, many of them are sensors in reschedule mode with poke_interval 60s.).

Original commit message in #4751:

This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.

The delay between tasks can be a major issue, especially when we have dags with
many subdags, figures out that the scheduling process spends plenty of time in
dependency checking, we took the trigger rule dependency which calls the db for
each task instance, we made it call the db just once for each dag_run

(cherry picked from commit 50efda5)

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 18, 2020
@yuqian90
Copy link
Contributor Author

@amichai07 @kaxil @ashb anyone interested to take a look? I'm porting #4751 to 1.10.* to fix a bad DagFileProcessor slowness.

@turbaszek
Copy link
Member

turbaszek commented Sep 22, 2020

I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge. What others think @kaxil @ashb @potiuk @mik-laj ?

@mik-laj
Copy link
Member

mik-laj commented Sep 22, 2020

@turbaszek I am not a release manager. Unfortunately, I cannot help you. I am focusing on the development of Airflow 2.0 If I can help you with anything else, please let me know.

@tooptoop4
Copy link
Contributor

pls merge

@yuqian90 yuqian90 force-pushed the v1-10-test_with_4751 branch from da1be37 to 8e1647c Compare September 23, 2020 01:21
@yuqian90
Copy link
Contributor Author

I'm not sure if we want to cherry-pick this fix to 1.10 as 2.0 is closer. On the other hand... the job of cherrypicking was done and we just need to merge.

Thanks @turbaszek . I cherry-picked this because the scheduler in 1.10.* is having trouble for large DAGs (not that large, just hundreds of tasks in one DAG). It queries the db too many times and was struggling to finish. (See flamegraph.before and flamegraph_after in the PR description.) Which caused us to hit #10790 too. So this cherry-pick is more of a fix rather than an improvement in some sense.

@pingzh
Copy link
Contributor

pingzh commented Oct 1, 2020

@yuqian90 you will need this as well: #7503

@potiuk potiuk force-pushed the v1-10-test branch 2 times, most recently from bd4214b to 195f00d Compare October 3, 2020 22:42
@yuqian90 yuqian90 force-pushed the v1-10-test_with_4751 branch from 8e1647c to d52ab8e Compare October 6, 2020 06:40
@yuqian90
Copy link
Contributor Author

yuqian90 commented Oct 6, 2020

@yuqian90 you will need this as well: #7503

Thanks for pointing out. I'm not using the features fixed in #7503 myself so it didn't affect me. However, you are right it should be included since it's a bug fix. So I cherry-picked it too.

@turbaszek
Copy link
Member

As mentioned in #11119 (comment) we are no more accepting new features to 1-10 branch :< I'm sorry. The 2.0 alpha should be released soon 🎉

@turbaszek turbaszek closed this Oct 6, 2020
@yuqian90
Copy link
Contributor Author

yuqian90 commented Oct 7, 2020

As mentioned in #11119 (comment) we are no more accepting new features to 1-10 branch :< I'm sorry. The 2.0 alpha should be released soon 🎉

Hi, @turbaszek and @dimberman I agree that #11119 can be closed since it's a brand new feature. However this one #11010 is not a new feature. It's more of a fix to 1.10.. When DAGs are just a little bit larger than usual (500 tasks) and when user tries to run dozens of DagRuns in at the same time (around 30), airflow-scheduler becomes super slow. This PR is the fix for that backported for 1.10..

@mik-laj
Copy link
Member

mik-laj commented Oct 26, 2020

@turbaszek I think this change should be in the next rellease. See: #11780 WDYT?

@potiuk @kaxil WDYT?

@kaxil
Copy link
Member

kaxil commented Oct 26, 2020

@yuqian90 Can you rebase it once more, please ? Appreciate it

…che#4751)

This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.

The delay between tasks can be a major issue, especially when we have dags with
many subdags, figures out that the scheduling process spends plenty of time in
dependency checking, we took the trigger rule dependency which calls the db for
each task instance, we made it call the db just once for each dag_run

(cherry picked from commit 50efda5)
@yuqian90 yuqian90 force-pushed the v1-10-test_with_4751 branch from d52ab8e to 435c225 Compare October 26, 2020 13:18
@yuqian90
Copy link
Contributor Author

@yuqian90 Can you rebase it once more, please ? Appreciate it

Done. Thank you!

@kaxil
Copy link
Member

kaxil commented Oct 26, 2020

Both commits have been cherry-picked to v1-10-test:

cb750c1
edae056

Thanks @yuqian90 -- Wanted to keep them as separate commits, hence not squashing and merging this PR, instead cherry-picked

@kaxil kaxil closed this Oct 26, 2020
@yuqian90 yuqian90 deleted the v1-10-test_with_4751 branch October 28, 2020 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants