Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Virtual Environment Activation Within DVC #5758

Closed
rogermparent opened this issue Apr 1, 2021 · 14 comments
Closed

Handle Virtual Environment Activation Within DVC #5758

rogermparent opened this issue Apr 1, 2021 · 14 comments
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension

Comments

@rogermparent
Copy link
Contributor

rogermparent commented Apr 1, 2021

There's many ways to go about solving this, some involving code changes for DVC and some simply establishing a convention with our current codebase, but each method has its own set of pros and cons and we'd like to use the one that best fits with how DVC is used.

Dealing with this issue properly can bring benefits to DVC CLI users, but will also greatly help with applications like the VSCode extension that consume DVC CLI as an API (for convenience, I'll call these consumers).

The problem

Whenever DVC calls an external script (like in exp run), running dvc without a PATH modified by an activated virtual environment is not sufficient.

Initially, we wanted to get this information from the VSCode Python extension in the same way it activates integrated VSCode terminals, but currently that extension does not export its internal logic to activate Python environments- beyond that, there's the question of running DVC in environments that aren't handled by the Python extension, particularly other languages.

Possible solutions

I'd like to present this list of solutions in order let others comment on them and pick which one is most preferred.

An overall theme I'd also like everyone to consider is:
"Is handling Virtual Environments something we'd like to handle within DVC itself?"

Activating Within dvc.yaml cmd

A dead-simple method is adding the activation command to the cmd field of dvc.yaml:

stages:
  train:
    cmd: >-
      source .env/bin/activate;
      python train.py;

It almost feels too simple, but configuring this way provides maximum freedom for projects and requires the least adaptation from consumers. However, this technically has a drawback in that users and consumers cannot freely use a project configured like this with a different or no virtual environment unless they change this command.

Adding a new activate configuration option

This is essentially a fancy version of the prior solution, but provides a feature that both users and consumers can get a lot of use out of.
Essentially, I'm thinking of a new activate config option that can be used like this:

At its most basic, a new top level option allows for specifying an activation command that will run before any normal cmd

activate: source .env/bin/activate
stages:
  train:
    cmd: python train.py

But individual stages can opt out...

  train_without_activation:
    activate: false
    cmd: python train.py

...or override the command.

  train_with_unique_activation:
    activate: conda activate specialenv
    cmd: python condatrain.py

and this could be paired with a CLI flag that can override the dvc.yaml settings for a specific run, similar to the params overrides.

A feature like this could go a long way for both users and consumers, especially where it's very user-configurable and stack-agnostic.

Templating

This is another solution that doesn't need modifications to dvc, but rather just establishes a convention that we'll want to document. Using Templating solves a few of the problems that the first dvc.yaml solution faces.

For example:

vars:
  - activate: source .env/bin/activate;
stages:
  train:
    cmd: ${activate} python train.py

Alternatively:

vars:
  - python: source .env/bin/activate; python
stages:
  train:
    cmd: ${python} train.py

I've had success running both of these with no activated environment needed beforehand.

Activating in Consumer Applications

I've probably shown my hand in that I'm slightly biased against this one, as re-implementing this on every consumer application for every DVC stack seems like an unnecessary duplication of work, despite technically being possible. I'm open to opposing arguments, however.

As mentioned in the introduction, the Python extension in VSCode is currently a little wonky to hook into as far as environment activation- we'll have to either re-create the extension's functionality or petition Microsoft to expose the API (either through Issue or a PR, they have a GitHub repo), and putting in the effort of doing so won't address languages in virtual environments other than Python.

There's an argument that Python seems to be the only language used with DVC that has these venv issues, but I believe that this alone isn't enough to disqualify the idea that we should handle arbitrary activation commands for both simplicity of implementation and the ability to handle current and future corner cases.

@rogermparent rogermparent added discussion requires active participation to reach a conclusion enhancement Enhances DVC A: experiments Related to dvc exp question I have a question? labels Apr 1, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Apr 2, 2021

DVC is meant to be language agnostic though. I get why this is an issue for VSCode, but this seems like it can be framed as a more "how should the VSCode extension handle pipeline runtime environments in general"? This same issue applies if a user's stage needs to run a ruby or java command that is actually dependent on a specific rbenv or jenv environment.

Does VSCode not provide a way to set PATH when running external programs? It seems like having the user set PATH for their runtime environment correctly within VSCode would be the proper solution to me.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 2, 2021

It also seems like the VSCode extension will need a way for users to configure environment variables in general, not just PATH. Depending on what the user's code is doing, they may expect things like secret keys/authentication tokens, or just plain configuration data to be passed in via environment variables that are normally set outside of DVC (and are unrelated to venv-like execution environment).

@shcheklein
Copy link
Member

First of all, thanks Roger. I really like that you made almost a compete RFC 🙏 ... with all the details and alternatives. It makes it extremely easy to read and understand.

DVC is meant to be language agnostic though.

I think Roger had a point about different languages here:

There's an argument that Python seems to be the only language used with DVC that has these venv issues, but I believe that this alone isn't enough to disqualify the idea that we should handle arbitrary activation commands for both simplicity of implementation and the ability to handle current and future corner cases.

I think this idea can be applied to Java for example (where you want to run a specific JDK).

Adding a new activate configuration option

One question. It feels that this configuration option should be "personal". People might have different environments they use, or at least different path to those on their machines. It means we can't easily save it into Git.

Activating in Consumer Applications

we can ask users to specify the same information in the project's settings if it comes to the point that there is no way to extract this info from the Python/other language extension. WDYT?

@pmrowla
Copy link
Contributor

pmrowla commented Apr 2, 2021

One question. It feels that this configuration option should be "personal". People might have different environments they use, or at least different path to those on their machines. It means we can't easily save it into Git.

we can ask users to specify the same information in the project's settings if it comes to the point that there is no way to extract this info from the Python/other language extension. WDYT?

@shcheklein I think we are on the same page here, it seems to me that it would make sense for the user's PATH (and any other env vars) to be configured as a part of a VSCode project, rather than within DVC pipeline files like dvc.yaml (and subsequently committed into git)

@rogermparent
Copy link
Contributor Author

rogermparent commented Apr 2, 2021

@pmrowla

DVC is meant to be language agnostic though. I get why this is an issue for VSCode, but this seems like it can be framed as a more "how should the VSCode extension handle pipeline runtime environments in general"? This same issue applies if a user's stage needs to run a ruby or java command that is actually dependent on a specific rbenv or jenv environment.

These features are absolutely language agnostic in that a generic activation command could be used for any virtual environment solution in any language, it's just the Python would be far and away the heaviest user of a feature like this. These features/techniques be used for something like rbenv, or nvm.

You have a point in that putting in all the work for the special activate feature would basically be all for Python. A convention like adding virtualenv activation in a variable would handle the python problem without necessitating special treatment.

Does VSCode not provide a way to set PATH when running external programs? It seems like having the user set PATH for their runtime environment correctly within VSCode would be the proper solution to me.

There's plenty of ways to set PATH, but the problem is getting the PATH that commands like source .env/bin/activate and conda activate set. The details and requirements of this also change with different virtualenv solutions, like conda and venv.

It also seems like the VSCode extension will need a way for users to configure environment variables in general, not just PATH. Depending on what the user's code is doing, they may expect things like secret keys/authentication tokens, or just plain configuration data to be passed in via environment variables that are normally set outside of DVC (and are unrelated to venv-like execution environment).

We could make a setting in the VSCode extension for passing specific env vars and/or activation commands to dvc runs, but the fact that dvc requires an activated venv to run experiments in projects that use them leads me to believe that a solution within dvc would be appropriate. The ability to dvc exp run on a project without worrying about virtual environments feels like good ux.

@shcheklein

One question. It feels that this configuration option should be "personal". People might have different environments they use, or at least different path to those on their machines. It means we can't easily save it into Git.

I think no matter what we do we'll need the ability to override this activation on a per-user basis, but most of the projects I've used as demos so far prescribe the use of a virtual environment in their README, going as far as to specify the exact creation and activation commands.
However, I have run into the edge case where only conda tensorflow worked for me so I needed to use that on a project that specified venv- I'm sympathetic to this kind of corner case so an override is important. ThisIn the context of the OP, this could be done with CLI overrides via a dedicated activate setting or could be done with a dvc.yaml var override feature.

we can ask users to specify the same information in the project's settings if it comes to the point that there is no way to extract this info from the Python/other language extension. WDYT?

It's less that there's no way, but rather all the ways available to us to get anything beyond the path to a python interpreter are impractical orand/or fragile. Should we decide not to establish this convention in DVC, the next step is probably just to add options in the VSCode Extension for DVC env and/or activation commands.

We can also take advantage of the ability to send prompts and popups to the user that can set the VSCode extension configuration to common settings in a more friendly way than setting them normally. This would also carry many of the benefits of the best solutions in the OP, include language-agnosticism.

As an aside, awesome discussion so far! This will help a ton with the VSCode extension.

@rogermparent
Copy link
Contributor Author

@pmrowla It seems you had accidentally edited a new comment over my prior one- I'm going to copy your comment here so it's not destroyed by a revert.


There's plenty of ways to set PATH, but the problem is getting the PATH that commands like source .env/bin/activate and conda activate set. The details and requirements of this also change with different virtualenv solutions, like conda and venv.

I guess I'm a bit confused on this one, as the resulting PATH after running activate is just .env/bin:$PATH, and with conda it is usually just ~/anaconda/bin/:$PATH. In both the venv and conda use cases, the user should know where they installed the venv or conda, so asking them to set PATH appropriately in VSCode seems reasonable to me.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 2, 2021

My mistake, meant to use quote reply

@skshetry skshetry removed A: experiments Related to dvc exp question I have a question? labels Apr 2, 2021
@daavoo
Copy link
Contributor

daavoo commented Apr 5, 2021

Hi there!

Maybe this is partially? out of the scope of the issue as the discussion seems to be focused on the VSCode Extension and I'd be describing our interests/opinions not related with that extension but more from a DVC CLI point of view:

"Is handling Virtual Environments something we'd like to handle within DVC itself?"`

We are quite interested in knowing which direction will take dvc regarding this matter and how we could help to get it there.

Our desired use case would be to have an environment description file (i.e. conda.yaml or Dockerfile) linked to a dvc.yaml in a way that dvc repro would (automagically?) handle the environment activation (and tracking i.e. always rerun if env file has changed) for running all the stages in the pipeline.

We have been exploring similar solutions to the ones exposed in the issue description but all felt a little "hacky" and required modifications to the existing dvc.yaml files; even more accentuated when trying to also cover the need for tracking the environment description file.

We are currently relying on wrapping our dvc pipelines inside MLflow Projects as a solution for that needs. This actually covers our needs but the integration between dvc and mlflow also feels a little "hacky".

Another alternative we have been exploring are Pachyderm pipelines which let you specify individual environments for each stage in the pipeline; however this felt more like a replacement and not compatible with our existing dvc pipelines (and overkill for some of the use cases).

@rogermparent
Copy link
Contributor Author

rogermparent commented Apr 5, 2021

Thanks for the feedback @daavoo!

Maybe this is partially? out of the scope of the issue as the discussion seems to be focused on the VSCode Extension and I'd be describing our interests/opinions not related with that extension but more from a DVC CLI point of view:

Actually, while this Issue stems from the VSCode Extension, it was made here since the Extension is one of the first in a list of DVC integrations that wrap the CLI which I'm sure will get larger over time. I opened the Issue here as a consideration that we may want to handle this in DVC which will also take care of the issue downstream in any application that calls dvc.

Our desired use case would be to have an environment description file (i.e. conda.yaml or Dockerfile) linked to a dvc.yaml in a way that dvc repro would (automagically?) handle the environment activation (and tracking i.e. always rerun if env file has changed) for running all the stages in the pipeline.

We have been exploring similar solutions to the ones exposed in the issue description but all felt a little "hacky" and required modifications to the existing dvc.yaml files; even more accentuated when trying to also cover the need for tracking the environment description file.

Good to know! I had figured the kind of solution described in the OP would be "opt-in", such that existing users aren't forced to switch until they want to plug a DVC project that depends on virtual environments into an integration that expects virtual environments to be handled.

We are currently relying on wrapping our dvc pipelines inside MLflow Projects as a solution for that needs. This actually covers our needs but the integration between dvc and mlflow also feels a little "hacky".

Another alternative we have been exploring are Pachyderm pipelines which let you specify individual environments for each stage in the pipeline; however this felt more like a replacement and not compatible with our existing dvc pipelines (and overkill for some of the use cases).

Interesting! I'd like to throw out a few suggestions for something like this that I've discovered while working with the VSCode Python extension, as there are a few programs not mentioned here that the VSC Python extension integrates with- This info comes primarily from the "Where the Extension looks for Environments" section of the "Using Python Environments in Visual Studio Code" article.

If nothing else, the fact that an official Microsoft extension goes out of its way to support these means they've gained enough traction to be worth looking into.

By name, there's virtualenvwrapper, pyenv, direnv, and pipenv. There's also an "Environment Variable Definitions File" which is a little less friendly but adds no dependencies.

I'm unsure how much users intend to integrate DVC with a tool like this, so the difficulty of doing so could certainly be even more than I'm thinking. Since this is a problem worth solving in a way that's well thought out, I have no issues adding a basic method of handling this to the DVC VSCode Extension and then integrating whatever method or convention we decide on for DVC CLI when it's added and usable.

As an aside, the DVC team should feel free to ask for input from the VSCode extension team for our thoughts on this issue or any others related to consuming DVC CLI as an API!

@dberenbaum
Copy link
Collaborator

My short take:

  1. "Specify an environment in which to run DVC" should be high priority (for collaboration like @daavoo mentions, for CML, and for remote execution support).
  2. Defining the environment in the DVC project should be the least preferred option (one of the major differences from tools like Pachyderm is that DVC code is separate from the environment).

@rogermparent

These are the problems I'm seeing described:

  1. There is currently no way for the user to specify the information about the environment they want to use (like the name of the environment or the desired value for PATH).
  2. Virtual environments are normally activated through CLI, which generally sets the PATH or other environment variables, and the appropriate values will differ for each virtual environment tool in Python (virtualenv, conda) or other languages. The extension should not have to have individual support for each virtual environment tool.

Do you agree with those?

@daavoo

Our desired use case would be to have an environment description file (i.e. conda.yaml or Dockerfile) linked to a dvc.yaml in a way that dvc repro would (automagically?) handle the environment activation (and tracking i.e. always rerun if env file has changed) for running all the stages in the pipeline.

This seems to challenge a core philosophy underneath DVC: that code and data are sufficient for reproducibility. The "environment tracking" part of your comment is really interesting, but could quickly derail this issue 😄 . Do you want to start a discussion, or we can figure out another way to communicate about it?

@rogermparent
Copy link
Contributor Author

rogermparent commented Apr 5, 2021

  1. "Specify an environment in which to run DVC" should be high priority (for collaboration like @daavoo mentions, for CML, and for remote execution support).

  2. Defining the environment in the DVC project should be the least preferred option (one of the major differences from tools like Pachyderm is that DVC code is separate from the environment).

Just to make sure we're on the same page: are you talking about "1" being a feature of the VSCode extension, or DVC CLI? The context of "2" seems to imply the former is the case.

These are the problems I'm seeing described:

  1. There is currently no way for the user to specify the information about the environment they want to use (like the name of the environment or the desired value for PATH).

  2. Virtual environments are normally activated through CLI, which generally sets the PATH or other environment variables, and the appropriate values will differ for each virtual environment tool in Python (virtualenv, conda) or other languages. The extension should not have to have individual support for each virtual environment tool.

Do you agree with those?

I agree for the most part, but if you're talking about the VSCode extension on "1" here, currently we follow the Python Extension's selected interpreter/environment- probably a bit too much so. The way dvc is invoked can be overridden for cases like a user having the dvc executable global and the project in a venv, but that is very much not the standard path with the extension as it is now despite it being very common as I've found out recently.

@dberenbaum
Copy link
Collaborator

Just to make sure we're on the same page: are you talking about "1" being a feature of the VSCode extension, or DVC CLI? The context of "2" seems to imply the former is the case.

I actually meant that "specify an environment in which to run DVC" should be high priority for DVC. I know that sounds contradictory to my other point 🤣 , but that doesn't necessarily imply that the environment needs to be defined within DVC (or at least not inside dvc.yaml). For example, it could be an addition to the docs to explain how to run DVC in a specified environment.

@rogermparent
Copy link
Contributor Author

For example, it could be an addition to the docs to explain how to run DVC in a specified environment.

I see! Glad I asked! The current convention specified in the docs is that it's strongly recommended to add dvc within the virtual environment, which is what gave me my initial misconception.

Should we specify that global dvc with project venv is another common a use case with its own advantages? Off the top of my head, not redownloading/duplicating the dvc install is the primary justification for this.

@dberenbaum dberenbaum added the product: VSCode Integration with VSCode extension label Apr 5, 2021
@dberenbaum
Copy link
Collaborator

Should we specify that global dvc with project venv is another common a use case with its own advantages? Off the top of my head, not redownloading/duplicating the dvc install is the primary justification for this.

Feel free to open an issue in https://github.com/iterative/dvc.org/issues if you have docs suggestions. A global install makes sense if I'm not too concerned about version changes, but if I have multiple projects and want to tie each to a frozen set of package versions, it would not be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

6 participants