Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG deletion is slow due to lack of database indexes on dag_id #20249

Closed
2 tasks done
kushsharma opened this issue Dec 13, 2021 · 4 comments · Fixed by #20282
Closed
2 tasks done

DAG deletion is slow due to lack of database indexes on dag_id #20249

kushsharma opened this issue Dec 13, 2021 · 4 comments · Fixed by #20282
Labels
affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug

Comments

@kushsharma
Copy link
Contributor

kushsharma commented Dec 13, 2021

Apache Airflow version

2.2.1

What happened

We have an airflow instance for approximately 6k DAGs.

  • If we delete a DAG from UI, the UI times out
  • If we delete a DAG from CLI, it completes but sometimes takes up to a half-hour to finish.

Most of the execution time appears to be consumed in database queries. I know I can just throw more CPU and memory to the db instance and hope it works but I think we can do better during delete operation. Correct me if I am wrong but I think this is the code that gets executed when deleting a DAG from UI or CLI via delete_dag.py

    for model in models.base.Base._decl_class_registry.values():
        if hasattr(model, "dag_id"):
            if keep_records_in_log and model.__name__ == 'Log':
                continue
            cond = or_(model.dag_id == dag_id, model.dag_id.like(dag_id + ".%"))
            count += session.query(model).filter(cond).delete(synchronize_session='fetch')
    if dag.is_subdag:
        parent_dag_id, task_id = dag_id.rsplit(".", 1)
        for model in TaskFail, models.TaskInstance:
            count += (
                session.query(model).filter(model.dag_id == parent_dag_id, model.task_id == task_id).delete()
            )

I see we are iterating over all the models and doing a dag_id match. Some of the tables don't have an index over dag_id column like job which is making this operation really slow. This could be one easy fix for this issue.

For example, the following query took 20 mins to finish in 16cpu 32gb Postgres instance:

SELECT job.id AS job_id FROM job WHERE job.dag_id = $1 OR job.dag_id LIKE $2

and explain is as follows

EXPLAIN SELECT job.id AS job_id FROM job WHERE job.dag_id = '';
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather  (cost=1000.00..1799110.10 rows=6351 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on job  (cost=0.00..1797475.00 rows=2646 width=8)
         Filter: ((dag_id)::text = ''::text)
(4 rows)

This is just one of the many queries that are being executed during the delete operation.

What you expected to happen

Deletion of DAG should not take this much time.

How to reproduce

No response

Operating System

nix

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@kushsharma kushsharma added area:core kind:bug This is a clearly a bug labels Dec 13, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 13, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@uranusjr
Copy link
Member

Let’s add indexes to those tables then. Would you be interested in looking into those various tables and create one for each of them?

@uranusjr uranusjr changed the title Slow deletion of a DAG DAG deletion is slow due to lack of database indexes on dag_id Dec 13, 2021
@sanjayhallan
Copy link

bit of analysis for this ticket, these are the models which do and dont have indexes on dag id

...     if hasattr(model, "dag_id"):
...             print(model)
... 
<class 'airflow.models.log.Log'> yes btree index
<class 'airflow.models.taskfail.TaskFail'> no 
<class 'airflow.models.taskreschedule.TaskReschedule'> no
<class 'airflow.models.xcom.BaseXCom'> no
<class 'airflow.models.taskinstance.TaskInstance'> no
<class 'airflow.models.dagrun.DagRun'> no
<class 'airflow.models.dag.DagTag'> no
<class 'airflow.models.dag.DagModel'> yes PK index
<class 'airflow.models.renderedtifields.RenderedTaskInstanceFields'> no
<class 'airflow.models.sensorinstance.SensorInstance'> no
<class 'airflow.models.slamiss.SlaMiss'> yes btree index

@kushsharma
Copy link
Contributor Author

There are composite indexes on some of the tables, I don't think we need explicit index of dag_id for them like

  • task_reschedule: idx_task_reschedule_dag_task_date" btree (dag_id, task_id, execution_date)
  • sensor_instance: ti_primary_key" UNIQUE, btree (dag_id, task_id, execution_date)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants