Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

Closed
Tracked by #1103
tatiana opened this issue Oct 18, 2023 · 7 comments · Fixed by #1079
Closed
Tracked by #1103

Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610

tatiana opened this issue Oct 18, 2023 · 7 comments · Fixed by #1079
Assignees
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc good first issue Good for newcomers
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Oct 18, 2023

As of Cosmos 1.2, when using ExecutionMode.VIRTUALENV, each task will create a new Python virtual environment, in a temporary directory:

self._venv_tmp_dir = TemporaryDirectory(prefix="cosmos-venv")

This can cause delays, as discussed in the Slack thread:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1697614400289939

This could be improved, assuming the Airflow worker node is reused to run multiple tasks.

Proposal
Persist the virtual environment directory if users set the configuration: ExecutionConfig.virtualenv_path.

@tatiana tatiana added enhancement New feature or request good first issue Good for newcomers labels Oct 18, 2023
@LennartKloppenburg
Copy link
Contributor

As discussed with @tatiana I'd be happy to take a first stab at this :)

@tatiana
Copy link
Collaborator Author

tatiana commented Oct 18, 2023

Thank you very much for taking the lead on this, @LennartKloppenburg ! Please, let us know if you'd like any support

@LennartKloppenburg
Copy link
Contributor

Thinking about whether it'd make sense to add a clean-up callback to delete the virtual env that should be executed after the whole DAG has completed?
Or should we defer this responsibility/flexibility to the end-user?

@LennartKloppenburg
Copy link
Contributor

More things to consider:

  1. It could be useful to also centralize and cache the installation of dbt deps, as this is currently still happening in every task with this PR. Implementing this too would invalidate the install_deps flag on the operator level (on DbtLocalBaseOperator for instance) and make more sense on the ExecutionConfig instead, at least when using using VirtualEnv as ExecutionMode.

  2. If we let an operator create a virtual env and cache it, it can cause concurrency issues with other operators simultaneously doing the same thing. My changes work when starting with 1 in isolation and then running the rest...
    Perhaps the most elegant solution would be to add a "set-up" operator at the start of a VirtualEnv-based DAG that sets up the env, installs python requirements, installs dbt deps and everything, and then another one that tears it down at the end. That would take care of a lot of the complexity we'd otherwise end up implementing on the operator level.

The current PR would also split the operator's responsibility into "do whatever DBT thing it needs to do" and "optionally set up a virtual env if necessary" and that feels like an anti-pattern.

@tatiana tatiana added area:performance Related to performance, like memory usage, CPU usage, speed, etc and removed enhancement New feature or request labels Oct 19, 2023
@tatiana
Copy link
Collaborator Author

tatiana commented Oct 26, 2023

@LennartKloppenburg, we discussed this last week, and I'll add some thoughts here as well:

The challenge with having the setup in a separate operator is that due to the distributed nature of Airflow, there is no guarantee that this operator would be run in the same nodes running the other dbt/Cosmos tasks.
And if users are running Airflow using the Airflow K8s executor, a new pod is created each time an Airflow worker node picks up a task - so it is necessary to have it at an operator level.

If the setup was in some remote service, we could try to leverage the Airflow 2.7 DAG-level setup/teardown:
https://airflow.apache.org/docs/apache-airflow/stable/howto/setup-and-teardown.html

If Airflow had a worker-node level setup/tear down, it would be optimal for our use case. Still, let's say different tasks had different dependencies - we may still have some concurrency challenges if we use asynchronous operators (it's been discussed to have Cosmos supporting dbt Cloud, for instance, and in that case, we'd be leveraging Airflow deferrable operators).

For this reason, I believe we'll need to come up with some solution to the concurrency issue at a task level, for now. More thoughts on the PR itself.

@LennartKloppenburg
Copy link
Contributor

Yep, you're totally right! Sorry for duplicating the question here, but your info is excellent as a paper-trail so let's leave it :D

Copy link

dosubot bot commented Jan 30, 2024

Hi, @tatiana,

I'm helping the Cosmos team manage their backlog and am marking this issue as stale. From what I understand, the issue proposes adding support for caching virtual environments created with ExecutionMode.VIRTUALENV in Cosmos to improve performance. There's a volunteer considering adding a clean-up callback and centralizing and caching the installation of dbt deps, and there's a discussion about potential concurrency issues and the need to address them at a task level due to the distributed nature of Airflow. Tatiana has provided insights on the challenges of having setup in a separate operator and the need to consider concurrency issues at a task level.

Could you please confirm if this issue is still relevant to the latest version of the Cosmos repository? If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 30, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024
@tatiana tatiana reopened this Jul 11, 2024
@tatiana tatiana added this to the Cosmos 1.6.0 milestone Jul 11, 2024
@tatiana tatiana mentioned this issue Aug 16, 2024
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants