-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
external workspaces #3920
Comments
does it mean that we'll have to duplicate files on s3 (other cloud) on
this notation does not solve one of the main problems - workspace isolation, so we'll have to use I would say that ideally what we want is something like:
now I should be able to do:
and if Ideally, the same should be happening with code - if we pass |
Not necessarily. As you know, currently push/pull doesn't affect external outs at all. I think we will be able to configure an
That seems very implicit, but there is definitely potential there. Though I'm not sure how could we isolate the code in such a way. |
Hi @efiop, Our setup still works great, as we all have individual "workspaces" on a single remote machine and at the same time share the cache on the same remove machine. The setup is still the same at described here: https://github.com/PeterFogh/dvc_dask_use_case, where the global DVC configuration provides the path to the individual workspaces. I think your proposal of getting DVC to better support workspaces on remove machines and data storage is important. The biggest advantage of DVC is using it to manage large pipelines with lots of data (i.e. > 5GB), for example, processing data from multiple sources and combining it to model features and model training/evaluation. I do not want for process or store this amount of data on my local machine, thus, in my opinion, DVC should mainly focus on the management of data and process execution on remote machines. DVC should also support isolation of such data and process execution to individual machines on a different location, meaning it should be possible to have:
@efiop, I like our proposal of the DVC "workspace", as I assume it will simplify my setup (meaning https://github.com/PeterFogh/dvc_dask_use_case). However, as I understand it, it will not make DVC more "flexible" or add more functionality than my current setup. Is that correctly understood? I think it is important that the DVC "workspace" incorporates a way to isolate DVC cache, DVC workspaces, and process execution to individual machines on different locations. @efiop, Is it possible that it could support that? |
Correct, it should prevent the misuse for the most part and enforce isolation in some sense.
So you mean abstracting not only the paths as we do now, but also the executors? Like being able to launch your stage locally or, say, on some ec2 machine or just another server? If so, we are also now actively looking into that (though only for local scenario) in #2799 . |
For the record: we also have a problem of currently using unsuitable checksums as hashes, which makes it impossible to E.g. gs --external user https://discord.com/channels/485586884165107732/485596304961962003/817175897488621588 |
For the record: we are getting forced into formalizing this at least internally for data management. No product decisions though. |
We currently support a so-called "external outputs" scenario, that is based on creating separate external cache locations for each type of the output (e.g. s3, ssh etc). This scenario is unpolished and and even straight broken in some scenarios and is also constantly being misused with an intention of importing files from remote locations to local cache/workspace or remote. People often don't realise that this effectively extends your workspace outside of your local repo and needs proper isolation, in order to not run into conflict with your collegue running
dvc checkout
while you are working on the file/dir on external workspace.This makes me think that we need to introduce proper terminology and abstraction to clearly state what this is all about and make this powerful feature usable. The solution is to introduce a concept of "workspace"s. It could possibly look something like this:
*** DRAFT ***
Now, unless explicitly configured otherwise, dvc will assume that you want to use
ssh://example.com/home/efiop/.dvc/cache
as a default cache for artifacts in that workspace. Similar to your local repo.or with a special workspace-notation (similar to a so-called remote-notation that we currently have: remote://myremote/path)
This notation is nice because it allows you to redefine your workspaces (e.g. for each coworker to use his own home directory on the server) pretty easily (plus the config options can be set once in the config section for that workspace).
Current
cache.local/ssh/s3/gs/etc
sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.CC @PeterFogh , would appreciate your thoughts on this 🙂
The text was updated successfully, but these errors were encountered: