Cache `profiles.yml` file when using Cosmos `ProfileMapping` #925

tatiana · 2024-04-29T10:20:38Z

Context

In Cosmos 1.3, when users opt to use ProfileMapping Cosmos creates a profiles.yml file in potentially two places:

When using dbt ls to render the dbt project into an Airflow DAG
When running dbt commands

Once we implement #924, which makes sure the profile file is created in the same way, we can extend the caching mechanism introduced in #904 also to cache the profiles.yml file.

Depends on: #924

Acceptance criteria

Create a cache for the (if used, mocked and) non-mocked profiles.yml per DAG/TaskGroup when using ProfileMapping
Decide on a criterion to decide when to use or not the cached profiles.yml (perhaps we could create a hash based on the keys in the connection and the profile args passed by the user?)
Reuse the existing profiles.yml whenever possible without recreating it.
The logs should contain less of the following:

Creating temporary profiles.yml at /var/folders/td/522y78v91d1f5wgh67mj3p0m0000gn/T/tmprpp66fdh/profiles.yml with the following contents:
default:
    outputs:
        dev:
            dbname: postgres
            host: 0.0.0.0
            password: '{{ env_var(''COSMOS_CONN_POSTGRES_PASSWORD'') }}'
            port: 5432
            schema: public
            type: postgres
            user: postgres
    target: dev

The text was updated successfully, but these errors were encountered:

…ct (#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of #800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented #800 available to more users. Closes: #722 I updated the documentation in the PR: #898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

tatiana · 2024-06-12T08:31:53Z

A key part to remember in this ticket is that we'll likely need two caches per ProfileMapping instance:

one when it's mocked
another when it's not mocked

tatiana · 2024-06-13T09:33:19Z

Hi @pankajastro , following what we discussed, this PRs may help out:

Do not worry about Airflow being distributed / remote storage as part of this task. We'll address this in #927

Some suggestions to break this down:

identify what is a good cache "version" using the profile config
decide if this should be specific per DAG/TaskGroup or not - and what should be the cache "identifier"
make the approach work for mocked profiles
lastly, make it work for non-mocked profiles

As you noticed, this is only needed if people are using ProfileMapping classes.

…ct (astronomer#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of astronomer#800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented astronomer#800 available to more users. Closes: astronomer#722 I updated the documentation in the PR: astronomer#898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (astronomer#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (astronomer#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (astronomer#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (astronomer#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (astronomer#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (astronomer#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (astronomer#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

Add dbt profile caching mechanism. 1. Introduced env `enable_cache_profile` to enable or disable profile caching. This will be enabled only if global `enable_cache` is enabled. 2. Users can set the env `profile_cache_dir_name`. This will be the name of a sub-dir inside `cache_dir` where cached profiles will be stored. This is optional, and the default name is `profile` 3. Example Path for versioned profile: `{cache_dir}/{profile_cache_dir}/592906f650558ce1dadb75fcce84a2ec09e444441e6af6069f19204d59fe428b/profiles.yml` 4. Implemented profile mapping hashing: first, the profile is serialized using pickle. Then, the profile_name and target_name are appended before hashing the data using the SHA-256 algorithm **Perf test result:** In local dev env with command ``` AIRFLOW_HOME=`pwd` AIRFLOW_CONN_EXAMPLE_CONN="postgres://postgres:[email protected]:5432/postgres" AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.10-2.8:test-performance ``` NUM_MODELS=100 - TIME=167.45248413085938 (with profile cache enabled) - TIME=173.94845390319824 (with profile cache disabled) NUM_MODELS=200 - TIME=376.2585120201111 (with profile cache enabled) - TIME=418.14210200309753 (with profile cache disabled) Closes: astronomer#925 Closes: astronomer#647

tatiana added area:performance Related to performance, like memory usage, CPU usage, speed, etc area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc labels Apr 29, 2024

tatiana mentioned this issue Apr 29, 2024

Improve performance by 22-35% or more by caching partial parse artefact #904

Merged

dosubot bot added dbt:list Primarily related to dbt list command or functionality parsing:dbt_ls Issues, questions, or features related to dbt_ls parsing profile:postgres Related to Postgres ProfileConfig labels Apr 29, 2024

tatiana added this to the 1.5.0 milestone Apr 30, 2024

tatiana mentioned this issue May 14, 2024

Reuse connections accross entire Dag #647

Closed

tatiana assigned pankajastro May 14, 2024

tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024

tatiana mentioned this issue May 17, 2024

Performance Improvements #980

Closed

tatiana added the epic-assigned label May 17, 2024

pankajastro mentioned this issue Jun 17, 2024

Support to cache profiles created via ProfileMapping #1046

Merged

2 tasks

tatiana closed this as completed in #1046 Jun 27, 2024

tatiana closed this as completed in 2d2e7af Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache `profiles.yml` file when using Cosmos `ProfileMapping` #925

Cache `profiles.yml` file when using Cosmos `ProfileMapping` #925

tatiana commented Apr 29, 2024 •

edited

Loading

tatiana commented Jun 12, 2024

tatiana commented Jun 13, 2024 •

edited

Loading

Cache profiles.yml file when using Cosmos ProfileMapping #925

Cache profiles.yml file when using Cosmos ProfileMapping #925

Comments

tatiana commented Apr 29, 2024 • edited Loading

tatiana commented Jun 12, 2024

tatiana commented Jun 13, 2024 • edited Loading

Cache `profiles.yml` file when using Cosmos `ProfileMapping` #925

Cache `profiles.yml` file when using Cosmos `ProfileMapping` #925

tatiana commented Apr 29, 2024 •

edited

Loading

tatiana commented Jun 13, 2024 •

edited

Loading