-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support caching virtualenvs created when using ExecutionMode.VIRTUALENV #610
Comments
As discussed with @tatiana I'd be happy to take a first stab at this :) |
Thank you very much for taking the lead on this, @LennartKloppenburg ! Please, let us know if you'd like any support |
Thinking about whether it'd make sense to add a clean-up callback to delete the virtual env that should be executed after the whole DAG has completed? |
More things to consider:
The current PR would also split the operator's responsibility into "do whatever DBT thing it needs to do" and "optionally set up a virtual env if necessary" and that feels like an anti-pattern. |
@LennartKloppenburg, we discussed this last week, and I'll add some thoughts here as well: The challenge with having the setup in a separate operator is that due to the distributed nature of Airflow, there is no guarantee that this operator would be run in the same nodes running the other dbt/Cosmos tasks. If the setup was in some remote service, we could try to leverage the Airflow 2.7 DAG-level setup/teardown: If Airflow had a worker-node level setup/tear down, it would be optimal for our use case. Still, let's say different tasks had different dependencies - we may still have some concurrency challenges if we use asynchronous operators (it's been discussed to have Cosmos supporting dbt Cloud, for instance, and in that case, we'd be leveraging Airflow deferrable operators). For this reason, I believe we'll need to come up with some solution to the concurrency issue at a task level, for now. More thoughts on the PR itself. |
Yep, you're totally right! Sorry for duplicating the question here, but your info is excellent as a paper-trail so let's leave it :D |
Hi, @tatiana, I'm helping the Cosmos team manage their backlog and am marking this issue as stale. From what I understand, the issue proposes adding support for caching virtual environments created with Could you please confirm if this issue is still relevant to the latest version of the Cosmos repository? If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you! |
As of Cosmos 1.2, when using
ExecutionMode.VIRTUALENV
, each task will create a new Python virtual environment, in a temporary directory:astronomer-cosmos/cosmos/operators/virtualenv.py
Line 68 in a433f15
This can cause delays, as discussed in the Slack thread:
https://apache-airflow.slack.com/archives/C059CC42E9W/p1697614400289939
This could be improved, assuming the Airflow worker node is reused to run multiple tasks.
Proposal
Persist the virtual environment directory if users set the configuration:
ExecutionConfig.virtualenv_path
.The text was updated successfully, but these errors were encountered: