-
Notifications
You must be signed in to change notification settings - Fork 180
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve performance by 22-35% or more by caching partial parse artefa…
…ct (#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of #800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented #800 available to more users. Closes: #722 I updated the documentation in the PR: #898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```
- Loading branch information
Showing
18 changed files
with
426 additions
and
70 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
from __future__ import annotations | ||
|
||
import shutil | ||
from pathlib import Path | ||
|
||
from airflow.models.dag import DAG | ||
from airflow.utils.task_group import TaskGroup | ||
|
||
from cosmos import settings | ||
from cosmos.constants import DBT_MANIFEST_FILE_NAME, DBT_TARGET_DIR_NAME | ||
from cosmos.dbt.project import get_partial_parse_path | ||
|
||
|
||
# It was considered to create a cache identifier based on the dbt project path, as opposed | ||
# to where it is used in Airflow. However, we could have concurrency issues if the same | ||
# dbt cached directory was being used by different dbt task groups or DAGs within the same | ||
# node. For this reason, as a starting point, the cache is identified by where it is used. | ||
# This can be reviewed in the future. | ||
def _create_cache_identifier(dag: DAG, task_group: TaskGroup | None) -> str: | ||
""" | ||
Given a DAG name and a (optional) task_group_name, create the identifier for caching. | ||
:param dag_name: Name of the Cosmos DbtDag being cached | ||
:param task_group_name: (optional) Name of the Cosmos DbtTaskGroup being cached | ||
:return: Unique identifier representing the cache | ||
""" | ||
if task_group: | ||
if task_group.dag_id is not None: | ||
cache_identifiers_list = [task_group.dag_id] | ||
if task_group.group_id is not None: | ||
cache_identifiers_list.extend([task_group.group_id.replace(".", "__")]) | ||
cache_identifier = "__".join(cache_identifiers_list) | ||
else: | ||
cache_identifier = dag.dag_id | ||
|
||
return cache_identifier | ||
|
||
|
||
def _obtain_cache_dir_path(cache_identifier: str, base_dir: Path = settings.cache_dir) -> Path: | ||
""" | ||
Return a directory used to cache a specific Cosmos DbtDag or DbtTaskGroup. If the directory | ||
does not exist, create it. | ||
:param cache_identifier: Unique key used as a cache identifier | ||
:param base_dir: Root directory where cache will be stored | ||
:return: Path to directory used to cache this specific Cosmos DbtDag or DbtTaskGroup | ||
""" | ||
cache_dir_path = base_dir / cache_identifier | ||
tmp_target_dir = cache_dir_path / DBT_TARGET_DIR_NAME | ||
tmp_target_dir.mkdir(parents=True, exist_ok=True) | ||
return cache_dir_path | ||
|
||
|
||
def _get_timestamp(path: Path) -> float: | ||
""" | ||
Return the timestamp of a path or 0, if it does not exist. | ||
:param path: Path to the file or directory of interest | ||
:return: File or directory timestamp | ||
""" | ||
try: | ||
timestamp = path.stat().st_mtime | ||
except FileNotFoundError: | ||
timestamp = 0 | ||
return timestamp | ||
|
||
|
||
def _get_latest_partial_parse(dbt_project_path: Path, cache_dir: Path) -> Path | None: | ||
""" | ||
Return the path to the latest partial parse file, if defined. | ||
:param dbt_project_path: Original dbt project path | ||
:param cache_dir: Path to the Cosmos project cache directory | ||
:return: Either return the Path to the latest partial parse file, or None. | ||
""" | ||
project_partial_parse_path = get_partial_parse_path(dbt_project_path) | ||
cosmos_cached_partial_parse_filepath = get_partial_parse_path(cache_dir) | ||
|
||
age_project_partial_parse = _get_timestamp(project_partial_parse_path) | ||
age_cosmos_cached_partial_parse_filepath = _get_timestamp(cosmos_cached_partial_parse_filepath) | ||
|
||
if age_project_partial_parse and age_cosmos_cached_partial_parse_filepath: | ||
if age_project_partial_parse > age_cosmos_cached_partial_parse_filepath: | ||
return project_partial_parse_path | ||
else: | ||
return cosmos_cached_partial_parse_filepath | ||
elif age_project_partial_parse: | ||
return project_partial_parse_path | ||
elif age_cosmos_cached_partial_parse_filepath: | ||
return cosmos_cached_partial_parse_filepath | ||
|
||
return None | ||
|
||
|
||
def _update_partial_parse_cache(latest_partial_parse_filepath: Path, cache_dir: Path) -> None: | ||
""" | ||
Update the cache to have the latest partial parse file contents. | ||
:param latest_partial_parse_filepath: Path to the most up-to-date partial parse file | ||
:param cache_dir: Path to the Cosmos project cache directory | ||
""" | ||
cache_path = get_partial_parse_path(cache_dir) | ||
manifest_path = get_partial_parse_path(cache_dir).parent / DBT_MANIFEST_FILE_NAME | ||
latest_manifest_filepath = latest_partial_parse_filepath.parent / DBT_MANIFEST_FILE_NAME | ||
|
||
shutil.copy(str(latest_partial_parse_filepath), str(cache_path)) | ||
shutil.copy(str(latest_manifest_filepath), str(manifest_path)) | ||
|
||
|
||
def _copy_partial_parse_to_project(partial_parse_filepath: Path, project_path: Path) -> None: | ||
""" | ||
Update target dbt project directory to have the latest partial parse file contents. | ||
:param partial_parse_filepath: Path to the most up-to-date partial parse file | ||
:param project_path: Path to the target dbt project directory | ||
""" | ||
target_partial_parse_file = get_partial_parse_path(project_path) | ||
tmp_target_dir = project_path / DBT_TARGET_DIR_NAME | ||
tmp_target_dir.mkdir(exist_ok=True) | ||
|
||
source_manifest_filepath = partial_parse_filepath.parent / DBT_MANIFEST_FILE_NAME | ||
target_manifest_filepath = target_partial_parse_file.parent / DBT_MANIFEST_FILE_NAME | ||
shutil.copy(str(partial_parse_filepath), str(target_partial_parse_file)) | ||
shutil.copy(str(source_manifest_filepath), str(target_manifest_filepath)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.