Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

tatiana · 2023-10-18T10:19:33Z

As of Cosmos 1.2, when using ExecutionMode.VIRTUALENV, each task will create a new Python virtual environment, in a temporary directory:

astronomer-cosmos/cosmos/operators/virtualenv.py

Line 68 in a433f15

self._venv_tmp_dir = TemporaryDirectory(prefix="cosmos-venv")

This can cause delays, as discussed in the Slack thread:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1697614400289939

This could be improved, assuming the Airflow worker node is reused to run multiple tasks.

Proposal
Persist the virtual environment directory if users set the configuration: ExecutionConfig.virtualenv_path.

The text was updated successfully, but these errors were encountered:

LennartKloppenburg · 2023-10-18T10:29:42Z

As discussed with @tatiana I'd be happy to take a first stab at this :)

tatiana · 2023-10-18T10:39:33Z

Thank you very much for taking the lead on this, @LennartKloppenburg ! Please, let us know if you'd like any support

LennartKloppenburg · 2023-10-18T12:47:40Z

Thinking about whether it'd make sense to add a clean-up callback to delete the virtual env that should be executed after the whole DAG has completed?
Or should we defer this responsibility/flexibility to the end-user?

LennartKloppenburg · 2023-10-19T06:57:04Z

More things to consider:

It could be useful to also centralize and cache the installation of dbt deps, as this is currently still happening in every task with this PR. Implementing this too would invalidate the install_deps flag on the operator level (on DbtLocalBaseOperator for instance) and make more sense on the ExecutionConfig instead, at least when using using VirtualEnv as ExecutionMode.
If we let an operator create a virtual env and cache it, it can cause concurrency issues with other operators simultaneously doing the same thing. My changes work when starting with 1 in isolation and then running the rest...
Perhaps the most elegant solution would be to add a "set-up" operator at the start of a VirtualEnv-based DAG that sets up the env, installs python requirements, installs dbt deps and everything, and then another one that tears it down at the end. That would take care of a lot of the complexity we'd otherwise end up implementing on the operator level.

The current PR would also split the operator's responsibility into "do whatever DBT thing it needs to do" and "optionally set up a virtual env if necessary" and that feels like an anti-pattern.

tatiana · 2023-10-26T09:53:16Z

@LennartKloppenburg, we discussed this last week, and I'll add some thoughts here as well:

The challenge with having the setup in a separate operator is that due to the distributed nature of Airflow, there is no guarantee that this operator would be run in the same nodes running the other dbt/Cosmos tasks.
And if users are running Airflow using the Airflow K8s executor, a new pod is created each time an Airflow worker node picks up a task - so it is necessary to have it at an operator level.

If the setup was in some remote service, we could try to leverage the Airflow 2.7 DAG-level setup/teardown:
https://airflow.apache.org/docs/apache-airflow/stable/howto/setup-and-teardown.html

If Airflow had a worker-node level setup/tear down, it would be optimal for our use case. Still, let's say different tasks had different dependencies - we may still have some concurrency challenges if we use asynchronous operators (it's been discussed to have Cosmos supporting dbt Cloud, for instance, and in that case, we'd be leveraging Airflow deferrable operators).

For this reason, I believe we'll need to come up with some solution to the concurrency issue at a task level, for now. More thoughts on the PR itself.

LennartKloppenburg · 2023-10-26T10:09:27Z

Yep, you're totally right! Sorry for duplicating the question here, but your info is excellent as a paper-trail so let's leave it :D

dosubot · 2024-01-30T00:51:02Z

Hi, @tatiana,

I'm helping the Cosmos team manage their backlog and am marking this issue as stale. From what I understand, the issue proposes adding support for caching virtual environments created with ExecutionMode.VIRTUALENV in Cosmos to improve performance. There's a volunteer considering adding a clean-up callback and centralizing and caching the installation of dbt deps, and there's a discussion about potential concurrency issues and the need to address them at a task level due to the distributed nature of Airflow. Tatiana has provided insights on the challenges of having setup in a separate operator and the need to consider concurrency issues at a task level.

Could you please confirm if this issue is still relevant to the latest version of the Cosmos repository? If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

tatiana added enhancement New feature or request good first issue Good for newcomers labels Oct 18, 2023

tatiana assigned LennartKloppenburg Oct 18, 2023

LennartKloppenburg mentioned this issue Oct 18, 2023

Add support for virtual env directory flag #611

Closed

2 tasks

tatiana added area:performance Related to performance, like memory usage, CPU usage, speed, etc and removed enhancement New feature or request labels Oct 19, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 30, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024

tatiana mentioned this issue Jul 5, 2024

Support persisting the LoadMode.VIRTUALENV directory #1079

Merged

tatiana reopened this Jul 11, 2024

tatiana added this to the Cosmos 1.6.0 milestone Jul 11, 2024

tatiana mentioned this issue Aug 16, 2024

Release Cosmos 1.6.0 #1103

Closed

18 tasks

tatiana closed this as completed in #1079 Aug 16, 2024

tatiana closed this as completed in 4273d99 Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

tatiana commented Oct 18, 2023

LennartKloppenburg commented Oct 18, 2023

tatiana commented Oct 18, 2023

LennartKloppenburg commented Oct 18, 2023

LennartKloppenburg commented Oct 19, 2023

tatiana commented Oct 26, 2023

LennartKloppenburg commented Oct 26, 2023

dosubot bot commented Jan 30, 2024

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

Comments

tatiana commented Oct 18, 2023

LennartKloppenburg commented Oct 18, 2023

tatiana commented Oct 18, 2023

LennartKloppenburg commented Oct 18, 2023

LennartKloppenburg commented Oct 19, 2023

tatiana commented Oct 26, 2023

LennartKloppenburg commented Oct 26, 2023

dosubot bot commented Jan 30, 2024