Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues on Initial Page Load #38776

Closed
2 tasks done
vchiapaikeo opened this issue Apr 5, 2024 · 11 comments
Closed
2 tasks done

Performance Issues on Initial Page Load #38776

vchiapaikeo opened this issue Apr 5, 2024 · 11 comments
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:performance area:webserver Webserver related Issues kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@vchiapaikeo
Copy link
Contributor

vchiapaikeo commented Apr 5, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.3

What happened?

We have started to notice more slowness on initial page load. On our deployment, it takes nearly 6s for the server to respond.

image

We started to do some analysis of the queries that are being run and we found 3 that take over 1400ms (and likely are being run serially).

  1. Getting running dags -1.4s
  2. Getting failed dags - 1.8s
  3. Populating the dags and data in the html table itself (which defaults to 25 and we've set to 100 but we don't see much improvement when we lower it - only 500ms at most) - 1.6s
image

All of these queries are attached in the following txt file with explain plans and explain analyzes:

slow_queries_on_initial_load.txt

What you think should happen instead?

We are wondering if there are any ways we could possibly improve the performance of initial page load. Some ideas that come to mind:

  1. Caching with short ttls that are invalidated intelligently
  2. Improved indexing strategies
  3. Concurrent/async query execution
  4. Your idea here!

How to reproduce

Add ~2-4k dags to an Airflow deployment and reload the initial page

Operating System

Debian 11

Versions of Apache Airflow Providers

N/A

Deployment

Official Apache Airflow Helm Chart

Deployment details

We deploy our cluster on GCP's GKE

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vchiapaikeo vchiapaikeo added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 5, 2024
@RNHTTR RNHTTR added area:webserver Webserver related Issues and removed needs-triage label for new issues that we didn't triage yet labels Apr 5, 2024
@eladkal
Copy link
Contributor

eladkal commented Apr 15, 2024

cc @ephraimbuddy @bbovenzi

@pankajkoti
Copy link
Member

pankajkoti commented Apr 24, 2024

hi @vchiapaikeo , were you able to make progress on this? Would like to learn if you added any additional indexes or so and would be more than happy to collaborate with you on this one.

My guess is we might need few more indexes by looking at the slow queries.

@vchiapaikeo
Copy link
Contributor Author

vchiapaikeo commented Apr 24, 2024

Hey @pankajkoti ! Yes, that would be wonderful! The only thing I can think of from looking at the queries and the explain plan is something around the dag table - since there already is an index on dag_run and the state col and the full table scans are happening there. Based on those explain plans, perhaps something on the dag table around dag_id, is_subdag, and is_active? I see two existing indices on dag right now:

CREATE INDEX idx_root_dag_id
   ON dag (root_dag_id ASC);

CREATE INDEX idx_next_dagrun_create_after
   ON dag (next_dagrun_create_after ASC);

The other thing I was thinking about (but have just been very slow to implement and try) is a ThreadPool for some of the queries in the /home route. The slow queries are here and are retrieved from this call. These get loaded serially here --> https://github.com/apache/airflow/blob/main/airflow/www/views.py#L858-L859

My thought is that these queries plus this one can all get submitted to a threadpool for execution so that they run concurrently. Wanted to get the teams' thoughts on that and if you think it would work. The implementation shouldn't be too difficult there.

@tirkarthi
Copy link
Contributor

Related work on making sql queries async. cc @hussein-awala

#36504

@vchiapaikeo
Copy link
Contributor Author

@hussein-awala , a ThreadPoolExecutor implementation would be easier to get going intially but my guess is that the switch to async calls here makes more sense overall and would be cleaner. Let me know what ya think when you have a few mins!

@sunank200
Copy link
Collaborator

sunank200 commented May 28, 2024

These queries (https://github.com/apache/airflow/blob/main/airflow/www/views.py#L855, https://github.com/apache/airflow/blob/main/airflow/www/views.py#L860, https://github.com/apache/airflow/blob/main/airflow/www/views.py#L866, and so on) are executed sequentially as specified by @vchiapaikeo. I attempted to run these queries locally with 4000 DAGs, and it took 1.86 seconds to display the running DAGs. Running these critical database queries concurrently in the index method using ThreadPoolExecutor may not be the best approach. Using threads means it still depends on blocking I/O, which can lead to inefficiencies, especially under heavy load limited by the Python Global Interpreter Lock.

Using async queries seems like a better approach #36504. SQLAlchemy has supported Asynchronous I/O (asyncio) since 1.4, which we can utilize here. (https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.html) cc: @tirkarthi @vchiapaikeo

@pankajkoti have you tried adding additional indexes as suggested in the thread? I tried it in my local setup using breeze but it doesn't seem to improve the performance drastically.

@vchiapaikeo
Copy link
Contributor Author

Right, makes sense - it sounds like the best solution then may be to wait for @hussein-awala 's AIP on introducing async SQLAlchemy sessions into the codebase (seems like they will start with connections and variables first) and then calling these db functions asynchronously on the webserver. Wdyt?

@pankajkoti
Copy link
Member

@pankajkoti have you tried adding additional indexes as suggested in the thread? I tried it in my local setup using breeze but it doesn't seem to improve the performance drastically.

@sunank200 yes, I did try adding few indexes based on the SQL queries that get fired but I did not observe any improvement either

@sunank200
Copy link
Collaborator

sunank200 commented May 29, 2024

Right, makes sense - it sounds like the best solution then may be to wait for @hussein-awala 's AIP on introducing async SQLAlchemy sessions into the codebase (seems like they will start with connections and variables first) and then calling these db functions asynchronously on the webserver. Wdyt?

Sure. Waiting for response from @hussein-awala.

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 13, 2024
Copy link

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:performance area:webserver Webserver related Issues kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

7 participants