Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external workspaces #3920

Open
efiop opened this issue May 31, 2020 · 6 comments
Open

external workspaces #3920

efiop opened this issue May 31, 2020 · 6 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@efiop
Copy link
Contributor

efiop commented May 31, 2020

We currently support a so-called "external outputs" scenario, that is based on creating separate external cache locations for each type of the output (e.g. s3, ssh etc). This scenario is unpolished and and even straight broken in some scenarios and is also constantly being misused with an intention of importing files from remote locations to local cache/workspace or remote. People often don't realise that this effectively extends your workspace outside of your local repo and needs proper isolation, in order to not run into conflict with your collegue running dvc checkout while you are working on the file/dir on external workspace.

This makes me think that we need to introduce proper terminology and abstraction to clearly state what this is all about and make this powerful feature usable. The solution is to introduce a concept of "workspace"s. It could possibly look something like this:

*** DRAFT ***

  1. Define the external workspace you want to attach:
dvc workspace add myssh ssh://example.com/home/efiop

Now, unless explicitly configured otherwise, dvc will assume that you want to use ssh://example.com/home/efiop/.dvc/cache as a default cache for artifacts in that workspace. Similar to your local repo.

  1. Use your workspace:
dvc add ssh://example.com/home/efiop/data
dvc run -o ssh://example.com/home/efiop/model.pkl ...

or with a special workspace-notation (similar to a so-called remote-notation that we currently have: remote://myremote/path)

dvc add ws://myssh/data
dvc run -o ws://myssh/model.pkl

This notation is nice because it allows you to redefine your workspaces (e.g. for each coworker to use his own home directory on the server) pretty easily (plus the config options can be set once in the config section for that workspace).

Current cache.local/ssh/s3/gs/etc sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.

CC @PeterFogh , would appreciate your thoughts on this 🙂

@efiop efiop added feature request Requesting a new feature triage Needs to be triaged labels May 31, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label May 31, 2020
@efiop efiop added the discussion requires active participation to reach a conclusion label May 31, 2020
@shcheklein
Copy link
Member

Current cache.local/ssh/s3/gs/etc sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.

does it mean that we'll have to duplicate files on s3 (other cloud) on dvc push? it'll copying data from s3://something/.dvc/cache to s3://remote-storage?

dvc run -o ssh://example.com/home/efiop/model.pkl

this notation does not solve one of the main problems - workspace isolation, so we'll have to use ws one similar to remote which kinda breaks the point, unless I'm missing something?


I would say that ideally what we want is something like:

dvc workspace add projectA-on-ssh ssh://example.com/home/efiop/projectA

now I should be able to do:

dvc run -d file

and if file does not exist locally - it should try to use the remote one.

Ideally, the same should be happening with code - if we pass file and code tries to read it - we try to resolve it properly.

@efiop
Copy link
Contributor Author

efiop commented Jun 1, 2020

does it mean that we'll have to duplicate files on s3 (other cloud) on dvc push? it'll copying data from s3://something/.dvc/cache to s3://remote-storage?

Not necessarily. As you know, currently push/pull doesn't affect external outs at all. I think we will be able to configure an external shared cache there too, that your local dvc push would push to, if you need to have it like that.

I would say that ideally what we want is something like:

That seems very implicit, but there is definitely potential there. Though I'm not sure how could we isolate the code in such a way.

@PeterFogh
Copy link
Contributor

Hi @efiop,
It has been a long time since I have contributed to the DVC community, mainly because the data science team, I'm a part of, got a stable setup for performing ML experiments using DVC version 0.84, after you released it as a Conda package. Hopefully, over the summer, we get the time to update to your new version 1, and see which new features we can benefit from :)

Our setup still works great, as we all have individual "workspaces" on a single remote machine and at the same time share the cache on the same remove machine. The setup is still the same at described here: https://github.com/PeterFogh/dvc_dask_use_case, where the global DVC configuration provides the path to the individual workspaces.

I think your proposal of getting DVC to better support workspaces on remove machines and data storage is important. The biggest advantage of DVC is using it to manage large pipelines with lots of data (i.e. > 5GB), for example, processing data from multiple sources and combining it to model features and model training/evaluation. I do not want for process or store this amount of data on my local machine, thus, in my opinion, DVC should mainly focus on the management of data and process execution on remote machines.

DVC should also support isolation of such data and process execution to individual machines on a different location, meaning it should be possible to have:

  • the DVC cache located on one storage machine (let us call it C),
  • the individual workspaces located on storage machine (let us call them W1 to Wn, where "n" is the number of team members, and
  • finally having individual machines (let us call them P1 to Pn) for process execution for each workspace. The process execution machines should, of course, have access to the dedicated workspace storage machine.
    The setup, I have "tried" to describe here, will probably the next step for my team' DVC setup.

@efiop, I like our proposal of the DVC "workspace", as I assume it will simplify my setup (meaning https://github.com/PeterFogh/dvc_dask_use_case). However, as I understand it, it will not make DVC more "flexible" or add more functionality than my current setup. Is that correctly understood?

I think it is important that the DVC "workspace" incorporates a way to isolate DVC cache, DVC workspaces, and process execution to individual machines on different locations. @efiop, Is it possible that it could support that?

@efiop
Copy link
Contributor Author

efiop commented Jul 4, 2020

However, as I understand it, it will not make DVC more "flexible" or add more functionality than my current setup. Is that correctly understood?

Correct, it should prevent the misuse for the most part and enforce isolation in some sense.

I think it is important that the DVC "workspace" incorporates a way to isolate DVC cache, DVC workspaces, and process execution to individual machines on different locations. @efiop, Is it possible that it could support that?

So you mean abstracting not only the paths as we do now, but also the executors? Like being able to launch your stage locally or, say, on some ec2 machine or just another server? If so, we are also now actively looking into that (though only for local scenario) in #2799 .

@efiop
Copy link
Contributor Author

efiop commented Mar 4, 2021

For the record: we also have a problem of currently using unsuitable checksums as hashes, which makes it impossible to push/pull/etc external data. So we might consider sacrificing more time when adding data, but compute a proper md5 (or whichever our default hash type will be) hash that will be compatible with local workspaces and across workspaces. Things like etags could be used the same way we use inode&mtime&size in local state db, to cache hashes for files/dirs.

E.g. gs --external user https://discord.com/channels/485586884165107732/485596304961962003/817175897488621588

@efiop
Copy link
Contributor Author

efiop commented Sep 30, 2021

For the record: we are getting forced into formalizing this at least internally for data management. No product decisions though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

3 participants