external workspaces #3920

efiop · 2020-05-31T20:16:58Z

We currently support a so-called "external outputs" scenario, that is based on creating separate external cache locations for each type of the output (e.g. s3, ssh etc). This scenario is unpolished and and even straight broken in some scenarios and is also constantly being misused with an intention of importing files from remote locations to local cache/workspace or remote. People often don't realise that this effectively extends your workspace outside of your local repo and needs proper isolation, in order to not run into conflict with your collegue running dvc checkout while you are working on the file/dir on external workspace.

This makes me think that we need to introduce proper terminology and abstraction to clearly state what this is all about and make this powerful feature usable. The solution is to introduce a concept of "workspace"s. It could possibly look something like this:

*** DRAFT ***

Define the external workspace you want to attach:

dvc workspace add myssh ssh://example.com/home/efiop

Now, unless explicitly configured otherwise, dvc will assume that you want to use ssh://example.com/home/efiop/.dvc/cache as a default cache for artifacts in that workspace. Similar to your local repo.

Use your workspace:

dvc add ssh://example.com/home/efiop/data
dvc run -o ssh://example.com/home/efiop/model.pkl ...

or with a special workspace-notation (similar to a so-called remote-notation that we currently have: remote://myremote/path)

dvc add ws://myssh/data
dvc run -o ws://myssh/model.pkl

This notation is nice because it allows you to redefine your workspaces (e.g. for each coworker to use his own home directory on the server) pretty easily (plus the config options can be set once in the config section for that workspace).

Current cache.local/ssh/s3/gs/etc sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.

CC @PeterFogh , would appreciate your thoughts on this 🙂

The text was updated successfully, but these errors were encountered:

shcheklein · 2020-06-01T00:18:09Z

Current cache.local/ssh/s3/gs/etc sections will get removed, because it is wrong that we operate based on cache schema and not the workspace we are working on.

does it mean that we'll have to duplicate files on s3 (other cloud) on dvc push? it'll copying data from s3://something/.dvc/cache to s3://remote-storage?

dvc run -o ssh://example.com/home/efiop/model.pkl

this notation does not solve one of the main problems - workspace isolation, so we'll have to use ws one similar to remote which kinda breaks the point, unless I'm missing something?

I would say that ideally what we want is something like:

dvc workspace add projectA-on-ssh ssh://example.com/home/efiop/projectA

now I should be able to do:

dvc run -d file

and if file does not exist locally - it should try to use the remote one.

Ideally, the same should be happening with code - if we pass file and code tries to read it - we try to resolve it properly.

efiop · 2020-06-01T00:49:11Z

does it mean that we'll have to duplicate files on s3 (other cloud) on dvc push? it'll copying data from s3://something/.dvc/cache to s3://remote-storage?

Not necessarily. As you know, currently push/pull doesn't affect external outs at all. I think we will be able to configure an external shared cache there too, that your local dvc push would push to, if you need to have it like that.

I would say that ideally what we want is something like:

That seems very implicit, but there is definitely potential there. Though I'm not sure how could we isolate the code in such a way.

PeterFogh · 2020-07-04T14:49:40Z

Hi @efiop,
It has been a long time since I have contributed to the DVC community, mainly because the data science team, I'm a part of, got a stable setup for performing ML experiments using DVC version 0.84, after you released it as a Conda package. Hopefully, over the summer, we get the time to update to your new version 1, and see which new features we can benefit from :)

Our setup still works great, as we all have individual "workspaces" on a single remote machine and at the same time share the cache on the same remove machine. The setup is still the same at described here: https://github.com/PeterFogh/dvc_dask_use_case, where the global DVC configuration provides the path to the individual workspaces.

I think your proposal of getting DVC to better support workspaces on remove machines and data storage is important. The biggest advantage of DVC is using it to manage large pipelines with lots of data (i.e. > 5GB), for example, processing data from multiple sources and combining it to model features and model training/evaluation. I do not want for process or store this amount of data on my local machine, thus, in my opinion, DVC should mainly focus on the management of data and process execution on remote machines.

DVC should also support isolation of such data and process execution to individual machines on a different location, meaning it should be possible to have:

the DVC cache located on one storage machine (let us call it C),
the individual workspaces located on storage machine (let us call them W1 to Wn, where "n" is the number of team members, and
finally having individual machines (let us call them P1 to Pn) for process execution for each workspace. The process execution machines should, of course, have access to the dedicated workspace storage machine.
The setup, I have "tried" to describe here, will probably the next step for my team' DVC setup.

@efiop, I like our proposal of the DVC "workspace", as I assume it will simplify my setup (meaning https://github.com/PeterFogh/dvc_dask_use_case). However, as I understand it, it will not make DVC more "flexible" or add more functionality than my current setup. Is that correctly understood?

I think it is important that the DVC "workspace" incorporates a way to isolate DVC cache, DVC workspaces, and process execution to individual machines on different locations. @efiop, Is it possible that it could support that?

efiop · 2020-07-04T18:51:51Z

However, as I understand it, it will not make DVC more "flexible" or add more functionality than my current setup. Is that correctly understood?

Correct, it should prevent the misuse for the most part and enforce isolation in some sense.

I think it is important that the DVC "workspace" incorporates a way to isolate DVC cache, DVC workspaces, and process execution to individual machines on different locations. @efiop, Is it possible that it could support that?

So you mean abstracting not only the paths as we do now, but also the executors? Like being able to launch your stage locally or, say, on some ec2 machine or just another server? If so, we are also now actively looking into that (though only for local scenario) in #2799 .

efiop · 2021-03-04T23:35:46Z

For the record: we also have a problem of currently using unsuitable checksums as hashes, which makes it impossible to push/pull/etc external data. So we might consider sacrificing more time when adding data, but compute a proper md5 (or whichever our default hash type will be) hash that will be compatible with local workspaces and across workspaces. Things like etags could be used the same way we use inode&mtime&size in local state db, to cache hashes for files/dirs.

E.g. gs --external user https://discord.com/channels/485586884165107732/485596304961962003/817175897488621588

efiop · 2021-09-30T23:49:53Z

For the record: we are getting forced into formalizing this at least internally for data management. No product decisions though.

efiop added feature request Requesting a new feature triage Needs to be triaged labels May 31, 2020

triage-new-issues bot removed the triage Needs to be triaged label May 31, 2020

efiop added the discussion requires active participation to reach a conclusion label May 31, 2020

This was referenced Jun 2, 2020

dvc: require --external for external outputs #3929

Merged

reducing confusion between dvc remote and git remote #3937

Closed

dmpetrov mentioned this issue Jun 14, 2020

Introducing cache types: data, metrics and plots, run-cache and per-file #4040

Closed

This was referenced Jul 1, 2020

tests: remotes: use TmpDir-like fixtures #4140

Merged

Endpoint URL is not taken into account when adding an external file from Minio #4151

Closed

efiop mentioned this issue Sep 2, 2020

support adding/transfering data straight to cache/remote #4520

Closed

EmmaBYPeng mentioned this issue Oct 9, 2020

DVC fails to push data from external cache to default remote #4686

Closed

efiop mentioned this issue Apr 20, 2021

Unable to outpus directly to google cloud storage like dvc v1.11.6 #5854

Closed

efiop mentioned this issue Dec 5, 2021

out: persitent outputs require odb when they should not #6533

Closed

skshetry mentioned this issue Jun 28, 2022

dvc exp run: experiment metrics are not reported when metric files are on another device than training code #7863

Closed

dmpetrov mentioned this issue Jul 10, 2022

Cloud versioning #7995

Closed

efiop mentioned this issue Jul 28, 2022

add --to-remote usage when rolling back versions #7850

Closed

dberenbaum mentioned this issue Oct 7, 2022

cloud versioning: external outputs #8411

Closed

pmrowla mentioned this issue Dec 5, 2022

cloud versioning: update --no-download support #8653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external workspaces #3920

external workspaces #3920

efiop commented May 31, 2020 •

edited

Loading

shcheklein commented Jun 1, 2020

efiop commented Jun 1, 2020

PeterFogh commented Jul 4, 2020

efiop commented Jul 4, 2020

efiop commented Mar 4, 2021

efiop commented Sep 30, 2021

external workspaces #3920

external workspaces #3920

Comments

efiop commented May 31, 2020 • edited Loading

shcheklein commented Jun 1, 2020

efiop commented Jun 1, 2020

PeterFogh commented Jul 4, 2020

efiop commented Jul 4, 2020

efiop commented Mar 4, 2021

efiop commented Sep 30, 2021

efiop commented May 31, 2020 •

edited

Loading