Support caching remotely #927

tatiana · 2024-04-29T10:51:47Z

Context

Since #904, Cosmos introduced caching, contributing to the latest performance improvements in 1.4.

However, one of the limitations of this approach is that the cache is stored locally, on disk. This means that:

if the Airflow scheduler/worker node that's running AIrflow is recreated, the cache will be have to be recreated
each of the Airflow worker nodes/schedulers will have to create their cache. This can be remarkably inefficient when using Airflow KubernetesExecutor

During the code review of the PR mentioned above, one of the feedbacks was that it would be great if we supported caching this in S3/GCS/Blob storage: #904 (comment) (from @jlaneve).

Another feedback was to leverage Airflow 2.8 ObjectStore: #904 (comment) or/and using an XCom backend to store the cache. (from @kaxil)

Acceptance Criteria

Decide on an approach to store the remote cache
Allow users to update/fetching cache from a remote location for all Airflow versions supported by Cosmos

The text was updated successfully, but these errors were encountered:

…ct (#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of #800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented #800 available to more users. Closes: #722 I updated the documentation in the PR: #898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

dwreeves · 2024-05-17T23:39:26Z

This is sort of a duplicate of #870, although I prefer we use this issue as yours is newer and more general. (E.g. I don't mention the use of xcoms as the cache.) Just tagging that issue to relate these discussions.

dwreeves · 2024-05-27T17:10:11Z

Remote filesystem stuff keeps coming up in multiple contexts. And we already have support for this in, of all places, cosmos/plugin/__init__.py with the open_file() function.

I think this function should be moved to some sort of utils file for interacting with remote filesystems, and the logic for getting the conn_id from the cosmos config should be decoupled from open_file() so it can be used more generically.

That said, we also want to make sure we are doing things idiomatically, as well. For Airflow 2.8+, ObjectStoragePath was essentially designed to do this. I'd like it if Cosmos felt like Airflow, and used things that are standard in Airflow.

For supporting older versions of Airflow, we can create some sort of compatibility thing:

# cosmos/compat/__init__.py
try:
    from airflow.io.path import ObjectStoragePath
except ImportError:
    from cosmos.compat._object_storage_path import ObjectStoragePath

where _object_storage_path.py contains an Airflow 2.4+ compliant implementation of ObjectStoragePath.

- Support `static_index.html` for dbt docs. - Refactor remote filesystem access functions in anticipation of moving them out of `cosmos/plugins/__init__.py`. Refactoring is designed to make them behave a little more predictably and to make them look a little more like Airflow 2.8+'s `ObjectStoragePath` class. Of course, this is far, far from complete. # Related Issue(s) - Main: #986 - Related: #927

…ct (astronomer#904) Improve the performance to run the benchmark DAG with 100 tasks by 34% and the benchmark DAG with 10 tasks by 22%, by persisting the dbt partial parse artifact in Airflow nodes. This performance can be even higher in the case of dbt projects that take more time to be parsed. With the introduction of astronomer#800, Cosmos supports using dbt partial parsing files. This feature has led to a substantial performance improvement, particularly for large dbt projects, both during Airflow DAG parsing (using LoadMode.DBT_LS) and also Airflow task execution (when using `ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`). There were two limitations with the initial support to partial parsing, which the current PR aims to address: 1. DAGs using Cosmos `ProfileMapping` classes could not leverage this feature. This is because the partial parsing relies on profile files not changing, and by default, Cosmos would mock the dbt profile in several parts of the code. The consequence is that users trying Cosmos 1.4.0a1 will see the following message: ``` 13:33:16 Unable to do partial parsing because profile has changed 13:33:16 Unable to do partial parsing because env vars used in profiles.yml have changed ``` 2. The user had to explicitly provide a `partial_parse.msgpack` file in the original project folder for their Airflow deployment - and if, for any reason, this became outdated, the user would not leverage the partial parsing feature. Since Cosmos runs dbt tasks from within a temporary directory, the partial parse would be stale for some users, it would be updated in the temporary directory, but the next time the task was run, Cosmos/dbt would not leverage the recently updated `partial_parse.msgpack` file. The current PR addresses these two issues respectfully by: 1. Allowing users that want to leverage Cosmos `ProfileMapping` and partial parsing to use `RenderConfig(enable_mock_profile=False)` 2. Introducing a Cosmos cache directory where we are persisting partial parsing files. This feature is enabled by default, but users can opt out by setting the Airflow configuration `[cosmos][enable_cache] = False` (exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`). Users can also define the temporary directory used to store these files using the `[cosmos][cache_dir]` Airflow configuration. By default, Cosmos will create and use a folder `cosmos` inside the system's temporary directory: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir . This PR affects both DAG parsing and task execution. Although it does not introduce an optimisation per se, it makes the partial parse feature implemented astronomer#800 available to more users. Closes: astronomer#722 I updated the documentation in the PR: astronomer#898 Some future steps related to optimization associated to caching to be addressed in separate PRs: i. Change how we create mocked profiles, to create the file itself in the same way, referencing an environment variable with the same name - and only changing the value of the environment variable (astronomer#924) ii. Extend caching to the `profiles.yml` created by Cosmos in the newly introduced `tmp/cosmos` without the need to recreate it every time (astronomer#925). iii. Extend caching to the Airflow DAG/Task group as a pickle file - this approach is more generic and would work for every type of DAG parsing and executor. (astronomer#926) iv. Support persisting/fetching the cache from remote storage so we don't have to replicate it for every Airflow scheduler and worker node. (astronomer#927) v. Cache dbt deps lock file/avoid installing dbt steps every time. We can leverage `package-lock.yml` introduced in dbt t 1.7 (https://docs.getdbt.com/reference/commands/deps#predictable-package-installs), but ideally, we'd have a strategy to support older versions of dbt as well. (astronomer#930) vi. Support caching `partial_parse.msgpack` even when vars change: https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d vii. Support partial parsing in Docker and Kubernetes Cosmos executors (astronomer#929) viii. Centralise all the Airflow-based config into Cosmos settings.py & create a dedicated docs page containing information about these (astronomer#928) **How to validate this change** Run the performance benchmark against this and the `main` branch, checking the value of `/tmp/performance_results.txt`. Example of commands run locally: ``` # Setup AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup # Run test for 100 dbt models per DAG: MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:[email protected]:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance ``` An example of output when running 100 with the main branch: ``` NUM_MODELS=100 TIME=114.18614888191223 MODELS_PER_SECOND=0.8757629623135543 DBT_VERSION=1.7.13 ``` And with the current PR: ``` NUM_MODELS=100 TIME=75.17766404151917 MODELS_PER_SECOND=1.33018232576064 DBT_VERSION=1.7.13 ```

- Support `static_index.html` for dbt docs. - Refactor remote filesystem access functions in anticipation of moving them out of `cosmos/plugins/__init__.py`. Refactoring is designed to make them behave a little more predictably and to make them look a little more like Airflow 2.8+'s `ObjectStoragePath` class. Of course, this is far, far from complete. # Related Issue(s) - Main: astronomer#986 - Related: astronomer#927

pankajkoti · 2024-08-12T10:16:05Z

We decided to use the Airflow Object Storage feature that is available since Airflow 2.8.0.

Since the approach is decided, I will create sub-tasks for caching remotely for each of the local storage cache package-lock, partial parse, profile cache and then close this ticket.

tatiana · 2024-08-16T10:58:29Z

We started doing this in #1147, but we still need to extend this feature to support other caches (partial parsing + manifest, profile, dbt_packages.lock). @pankajkoti will be logging sub tasks so we can address this over time.

pankajkoti · 2024-08-27T10:24:39Z

We have achieved the acceptance criteria for this ticket.
For follow-up work, I have created the below tickets:
#1177
#1178
#1179

I am hence closing this ticket.

tatiana added area:performance Related to performance, like memory usage, CPU usage, speed, etc area:rendering Related to rendering, like Jinja, Airflow tasks, etc area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc labels Apr 29, 2024

tatiana mentioned this issue Apr 29, 2024

Improve performance by 22-35% or more by caching partial parse artefact #904

Merged

dosubot bot added the execution:kubernetes Related to Kubernetes execution environment label Apr 29, 2024

tatiana added this to the 1.6.0 milestone Apr 30, 2024

tatiana mentioned this issue May 21, 2024

WIP: Add support for DAG & TaskGroup level caching (performance improvement) #992

Closed

dwreeves mentioned this issue May 27, 2024

support static_index.html docs #999

Merged

2 tasks

This was referenced Jun 6, 2024

Cache TaskGroup/DAG regardless of the load_method #926

Closed

Establish general pattern for uploading artifacts to storage #894

Open

This was referenced Jun 11, 2024

Fix Cosmos enable_cache setting #1025

Merged

[Bug] Cosmos stale temporary directories #958

Open

[Bug] Caching filling up Airflow nodes disk #1042

Closed

Cache profiles.yml file when using Cosmos ProfileMapping #925

Closed

tatiana added the priority:high High priority issues are blocking or critical issues without a workaround and large impact label Jun 24, 2024

tatiana mentioned this issue Jun 27, 2024

[Feature] Allow storing dbt ls cache into Object Store #1072

Closed

1 task

tatiana assigned pankajkoti Jul 1, 2024

phanikumv mentioned this issue Jul 18, 2024

Release Cosmos 1.6.0 #1103

Closed

18 tasks

tatiana modified the milestones: Cosmos 1.6.0, Cosmos 1.7.0 Aug 16, 2024

pankajkoti closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support caching remotely #927

Support caching remotely #927

tatiana commented Apr 29, 2024 •

edited by pankajkoti

Loading

dwreeves commented May 17, 2024 •

edited

Loading

dwreeves commented May 27, 2024 •

edited

Loading

pankajkoti commented Aug 12, 2024

tatiana commented Aug 16, 2024

pankajkoti commented Aug 27, 2024

Support caching remotely #927

Support caching remotely #927

Comments

tatiana commented Apr 29, 2024 • edited by pankajkoti Loading

dwreeves commented May 17, 2024 • edited Loading

dwreeves commented May 27, 2024 • edited Loading

pankajkoti commented Aug 12, 2024

tatiana commented Aug 16, 2024

pankajkoti commented Aug 27, 2024

tatiana commented Apr 29, 2024 •

edited by pankajkoti

Loading

dwreeves commented May 17, 2024 •

edited

Loading

dwreeves commented May 27, 2024 •

edited

Loading