-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Virtual Environment Activation Within DVC #5758
Comments
DVC is meant to be language agnostic though. I get why this is an issue for VSCode, but this seems like it can be framed as a more "how should the VSCode extension handle pipeline runtime environments in general"? This same issue applies if a user's stage needs to run a ruby or java command that is actually dependent on a specific rbenv or jenv environment. Does VSCode not provide a way to set PATH when running external programs? It seems like having the user set PATH for their runtime environment correctly within VSCode would be the proper solution to me. |
It also seems like the VSCode extension will need a way for users to configure environment variables in general, not just PATH. Depending on what the user's code is doing, they may expect things like secret keys/authentication tokens, or just plain configuration data to be passed in via environment variables that are normally set outside of DVC (and are unrelated to venv-like execution environment). |
First of all, thanks Roger. I really like that you made almost a compete RFC 🙏 ... with all the details and alternatives. It makes it extremely easy to read and understand.
I think Roger had a point about different languages here:
I think this idea can be applied to Java for example (where you want to run a specific JDK).
One question. It feels that this configuration option should be "personal". People might have different environments they use, or at least different path to those on their machines. It means we can't easily save it into Git.
we can ask users to specify the same information in the project's settings if it comes to the point that there is no way to extract this info from the Python/other language extension. WDYT? |
@shcheklein I think we are on the same page here, it seems to me that it would make sense for the user's PATH (and any other env vars) to be configured as a part of a VSCode project, rather than within DVC pipeline files like |
These features are absolutely language agnostic in that a generic activation command could be used for any virtual environment solution in any language, it's just the Python would be far and away the heaviest user of a feature like this. These features/techniques be used for something like rbenv, or nvm. You have a point in that putting in all the work for the special activate feature would basically be all for Python. A convention like adding virtualenv activation in a variable would handle the python problem without necessitating special treatment.
There's plenty of ways to set PATH, but the problem is getting the PATH that commands like source .env/bin/activate and conda activate set. The details and requirements of this also change with different virtualenv solutions, like conda and venv.
We could make a setting in the VSCode extension for passing specific env vars and/or activation commands to dvc runs, but the fact that dvc requires an activated venv to run experiments in projects that use them leads me to believe that a solution within dvc would be appropriate. The ability to dvc exp run on a project without worrying about virtual environments feels like good ux.
I think no matter what we do we'll need the ability to override this activation on a per-user basis, but most of the projects I've used as demos so far prescribe the use of a virtual environment in their README, going as far as to specify the exact creation and activation commands.
It's less that there's no way, but rather all the ways available to us to get anything beyond the path to a python interpreter are impractical orand/or fragile. Should we decide not to establish this convention in DVC, the next step is probably just to add options in the VSCode Extension for DVC env and/or activation commands. We can also take advantage of the ability to send prompts and popups to the user that can set the VSCode extension configuration to common settings in a more friendly way than setting them normally. This would also carry many of the benefits of the best solutions in the OP, include language-agnosticism. As an aside, awesome discussion so far! This will help a ton with the VSCode extension. |
@pmrowla It seems you had accidentally edited a new comment over my prior one- I'm going to copy your comment here so it's not destroyed by a revert.
I guess I'm a bit confused on this one, as the resulting PATH after running |
My mistake, meant to use quote reply |
Hi there! Maybe this is partially? out of the scope of the issue as the discussion seems to be focused on the VSCode Extension and I'd be describing our interests/opinions not related with that extension but more from a
We are quite interested in knowing which direction will take Our desired use case would be to have an environment description file (i.e. We have been exploring similar solutions to the ones exposed in the issue description but all felt a little "hacky" and required modifications to the existing We are currently relying on wrapping our dvc pipelines inside MLflow Projects as a solution for that needs. This actually covers our needs but the integration between Another alternative we have been exploring are Pachyderm pipelines which let you specify individual environments for each stage in the pipeline; however this felt more like a replacement and not compatible with our existing |
Thanks for the feedback @daavoo!
Actually, while this Issue stems from the VSCode Extension, it was made here since the Extension is one of the first in a list of DVC integrations that wrap the CLI which I'm sure will get larger over time. I opened the Issue here as a consideration that we may want to handle this in DVC which will also take care of the issue downstream in any application that calls
Good to know! I had figured the kind of solution described in the OP would be "opt-in", such that existing users aren't forced to switch until they want to plug a DVC project that depends on virtual environments into an integration that expects virtual environments to be handled.
Interesting! I'd like to throw out a few suggestions for something like this that I've discovered while working with the VSCode Python extension, as there are a few programs not mentioned here that the VSC Python extension integrates with- This info comes primarily from the "Where the Extension looks for Environments" section of the "Using Python Environments in Visual Studio Code" article. If nothing else, the fact that an official Microsoft extension goes out of its way to support these means they've gained enough traction to be worth looking into. By name, there's I'm unsure how much users intend to integrate DVC with a tool like this, so the difficulty of doing so could certainly be even more than I'm thinking. Since this is a problem worth solving in a way that's well thought out, I have no issues adding a basic method of handling this to the DVC VSCode Extension and then integrating whatever method or convention we decide on for DVC CLI when it's added and usable. As an aside, the DVC team should feel free to ask for input from the VSCode extension team for our thoughts on this issue or any others related to consuming DVC CLI as an API! |
My short take:
These are the problems I'm seeing described:
Do you agree with those?
This seems to challenge a core philosophy underneath DVC: that code and data are sufficient for reproducibility. The "environment tracking" part of your comment is really interesting, but could quickly derail this issue 😄 . Do you want to start a discussion, or we can figure out another way to communicate about it? |
Just to make sure we're on the same page: are you talking about "1" being a feature of the VSCode extension, or DVC CLI? The context of "2" seems to imply the former is the case.
I agree for the most part, but if you're talking about the VSCode extension on "1" here, currently we follow the Python Extension's selected interpreter/environment- probably a bit too much so. The way dvc is invoked can be overridden for cases like a user having the |
I actually meant that "specify an environment in which to run DVC" should be high priority for DVC. I know that sounds contradictory to my other point 🤣 , but that doesn't necessarily imply that the environment needs to be defined within DVC (or at least not inside |
I see! Glad I asked! The current convention specified in the docs is that it's strongly recommended to add Should we specify that global |
Feel free to open an issue in https://github.com/iterative/dvc.org/issues if you have docs suggestions. A global install makes sense if I'm not too concerned about version changes, but if I have multiple projects and want to tie each to a frozen set of package versions, it would not be ideal. |
There's many ways to go about solving this, some involving code changes for DVC and some simply establishing a convention with our current codebase, but each method has its own set of pros and cons and we'd like to use the one that best fits with how DVC is used.
Dealing with this issue properly can bring benefits to DVC CLI users, but will also greatly help with applications like the VSCode extension that consume DVC CLI as an API (for convenience, I'll call these consumers).
The problem
Whenever DVC calls an external script (like in exp run), running dvc without a PATH modified by an activated virtual environment is not sufficient.
Initially, we wanted to get this information from the VSCode Python extension in the same way it activates integrated VSCode terminals, but currently that extension does not export its internal logic to activate Python environments- beyond that, there's the question of running DVC in environments that aren't handled by the Python extension, particularly other languages.
Possible solutions
I'd like to present this list of solutions in order let others comment on them and pick which one is most preferred.
An overall theme I'd also like everyone to consider is:
"Is handling Virtual Environments something we'd like to handle within DVC itself?"
Activating Within dvc.yaml cmd
A dead-simple method is adding the activation command to the
cmd
field ofdvc.yaml
:It almost feels too simple, but configuring this way provides maximum freedom for projects and requires the least adaptation from consumers. However, this technically has a drawback in that users and consumers cannot freely use a project configured like this with a different or no virtual environment unless they change this command.
Adding a new
activate
configuration optionThis is essentially a fancy version of the prior solution, but provides a feature that both users and consumers can get a lot of use out of.
Essentially, I'm thinking of a new activate config option that can be used like this:
At its most basic, a new top level option allows for specifying an activation command that will run before any normal cmd
But individual stages can opt out...
...or override the command.
and this could be paired with a CLI flag that can override the dvc.yaml settings for a specific run, similar to the params overrides.
A feature like this could go a long way for both users and consumers, especially where it's very user-configurable and stack-agnostic.
Templating
This is another solution that doesn't need modifications to
dvc
, but rather just establishes a convention that we'll want to document. Using Templating solves a few of the problems that the firstdvc.yaml
solution faces.For example:
Alternatively:
I've had success running both of these with no activated environment needed beforehand.
Activating in Consumer Applications
I've probably shown my hand in that I'm slightly biased against this one, as re-implementing this on every consumer application for every DVC stack seems like an unnecessary duplication of work, despite technically being possible. I'm open to opposing arguments, however.
As mentioned in the introduction, the Python extension in VSCode is currently a little wonky to hook into as far as environment activation- we'll have to either re-create the extension's functionality or petition Microsoft to expose the API (either through Issue or a PR, they have a GitHub repo), and putting in the effort of doing so won't address languages in virtual environments other than Python.
There's an argument that Python seems to be the only language used with DVC that has these venv issues, but I believe that this alone isn't enough to disqualify the idea that we should handle arbitrary activation commands for both simplicity of implementation and the ability to handle current and future corner cases.
The text was updated successfully, but these errors were encountered: