Specify what should be pushed to different remotes #2095

nirtiac · 2019-06-05T15:02:13Z

I need to be able to push versioned trained models to a production server. Both data and models will be created on one machine and versioned using DVC. However, I have sensitive information in my training server data that must never be pushed off the server.

As it stands, we are currently figuring out how to push models to a remote that can be tied back to their versioned data/code within DVC while never also pushing the data. This requires a solution outside of DVC. It would be great if DVC could provide the capability to specify file types that can or can't be pushed to a remote once committed.

Suor · 2019-06-11T06:12:04Z

Sounds like this is not about file types, but a need to specify what to push in general. What if you have files of same type ones with sensitive data and others not? You have this nice property the latter being models and the former being something else, which won't always be the case.

nirtiac · 2019-06-11T14:41:15Z

Yep, very true @Suor - that would be even better!

shcheklein · 2019-06-11T17:57:55Z

so, add a flag per output, something like --out-local to avoid pushing it to remote at all? it should be pretty straightforward. My only concern that number of different options for different types of outputs double every time. We should think about some general mechanism to specify options per output?

jorgeorpinel · 2019-11-20T17:18:08Z

It seems we hear this feature request often on the chat, and there's even a very similar SO question: How to use different remotes for different folders?
: "I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models."

The current solution is to setup the remotes, and then use dvc push -r {REMOTE} {data/model} individually for each data/model file.

As suggested in this issue though, it would be useful to setup a map of certain outputs/directories/file types to specific default remotes. For example we could add a --out option to dvc remote default (somehow recorded in .dvc/config) so DVC knows that one output file should be pushed to a specific remote by default (unless overwritten with dvc push -r, of course. The option could be called something else like --map and accept not only outputs but any file/dir path or even file type filter (e.g. something like **/*.csv).

Suor · 2019-11-20T17:29:02Z

Mapping could be done two ways:

specify outputs, possibly patterns, in a remote config
specify remote in an output definition in a stage file

dmpetrov · 2019-11-30T20:14:51Z

@nirtiac for data from different storages or with different access permissions, you can use new DVC functionality - dvc import. This works only in multiple-repository environment (not monorepo) but it potentially gives you an elegant solution of separating data remotes.

The dataset can be presented in a separate git-repo with its own remote. Repo: https://github.com/mycorp/rawdata. Remote like: s3://rawdata-bucket
Modeling code with your ML model in another repo like https://github.com/mycorp/predict. Remote like: s3://predict-bucket.
You just import dataset from your model repo dvc import https://github.com/mycorp/rawdata images.
Now, the model repo has data from different data remotes. dvc push from the model repo won't push the imported dataset into the model remote.
You can share only your model repo predict for the ML model consumers (prod environment for example) and access key for the model bucket s3://predict-bucket. It prevents consumers from using source dataset - s3://rawdata-bucket.

Technically, this solution looks more or less the same as the proposed solution with output types (@shcheklein ) or a working solution with specified remotes (@jorgeorpinel ) but it provides a higher-level abstraction (dataset repo, ML model repo) and should be a better choice for users.

nirtiac · 2019-12-09T02:37:52Z

This is great, thank you @dmpetrov !

z0u · 2020-01-07T06:37:41Z

I'm after something similar. I need to train a model on sensitive data. I was intending to split the training into two phases:

Dev: fast iteration on synthetic/scrubbed data. Training would be done on a developer's machine. The model could be served locally and used for contract tests.
Prod: final model trained on real data. Training would happen in a secure environment.

There would be two remotes: data-dev and data-prod (e.g. two S3 buckets). To prevent cross-pollination, only one of the remotes should be used from each environment. So maybe a project structured like this:

foo-training/ -> [email protected]/foo-training.git
    data-dev/ -> s3://data-dev/foo/
        training-examples.dvc
        model.dvc
    data-prod/ -> s3://data-prod/foo/
        training-examples.dvc
        model.dvc
    src/...

When training, the developer would specify which data directory to use. Does this seem reasonable? I guess I would need specify the remote when pulling and pushing the data directories, as @jorgeorpinel said?

pared · 2020-01-07T11:58:14Z

@z0u your use case could be fulfilled using dvc import-url command.
Such workflow could be described as follows:

Create dev_repo, use dvc import-url to obtain development data
Write training code, commit to your git repo when finished
[On prod env] clone repo, create branch for prod training
[On prod env] override data using dvc import-url again, this time import real data
[On prod env] run dvc repro to rerun all steps from pipeline with new data

If it will be of any help, here is bash script showing how that could look:

#!/bin/bash

rm -rf dev_repo storage data_dev data_prod git_repo prod_repo
mkdir dev_repo storage data_dev data_prod git_repo

echo data_development >> data_dev/data_file_dev
echo data_production >> data_prod/data_file_prod
maindir=$(pwd)

# create empty repo
pushd git_repo
git init --bare --quiet
popd

# create dev environment
pushd dev_repo
git init --quiet && dvc init --quiet
git remote add origin $maindir/git_repo

git add -A
git commit -am "dvc init"
git push origin master

#import dev_data and create pipeline using dev_data
dvc import-url $maindir/data_dev data
git add data.dvc .gitignore
git commit -am "add data"

# write code
echo -e "import os\nprint(os.listdir('data'))\nwith open('pipeline_result', 'w') as fd:\n    fd.write(str(os.listdir('data')))">>code.py

# create dvc pipeline
dvc run -d data -d code.py -o pipeline_result python code.py

#commit progress
git add pipeline_result.dvc .gitignore code.py
git commit -am "write code, run pipeline"
git push origin master

echo '#########################'
echo 'Pipeline result content:'
cat pipeline_result
echo -e '\n#########################'

popd

# clone git repository to safe production environment
git clone git_repo prod_repo
pushd prod_repo

# create branch to use production data
git checkout -b with_prod_data

# note that we are overriding data with production data
dvc import-url $maindir/data_prod data
git commit -am "import prod data"

# rerun whole pipeline to make use of new data
dvc repro pipeline_result.dvc
git commit -am "retrained with prod data"
git push origin master

echo '#########################'
echo 'Pipeline result content:'
cat pipeline_result
echo -e '\n#########################'

It might seem intimidating on first sight, but take a note that most of the script code is to create repos and navigate around git commits.
The most important part is dvc import-url $maindir/data_prod data, where we re-import data folder (with production data) and dvc repro pipeline_result.dvc where we rerun the pipeline with new data.

[EDIT]
@z0u Also, please take note of @dmpetrov comment about creating separate repository for data, that could be also useful if you need strict separation of prod/dev data.

zimka · 2021-04-17T18:08:16Z

I have a somewhat related problem now, where I want to organize a data registry that contains both public and private datasets, which would be used by multiple devs. Private datasets contain sensitive information, which should not be accessible for devs without necessity, while public datasets are accessible for everyone in the company. Every private dataset is different and independent, which means that it is a valid scenario, where someone has access to the PrivateDatasetA, but not PrivateDatasetB and vice versa.
Essentially I'd like to be able to grant devs with read/write permissions in many-to-many style.
It can not be done out-of-box AFAIK, and for now the most suitable workaround from my point of view is based on using multiple remote storages. Most of remote types (such as S3, SSH, or local-network) provide some way to manage permissions, so it is possible to grant different remote permissions to different devs. There is a problem however, that to use such workaround different datasets (PublicDataset1.dvc and PrivateDatasetA.dvc) must have different remotes. Also to use the whole repo as a data registry these remotes must be marked as default, which is not possible since there is only one default remote for repo. So for now either multiple subdir dvc projects in the same repo should be used or multiple repos. So here is an example structure of such registry repo:

MyDvcRegistry/ (https://github.com/me/MyDvcRegistry)
--public/ (DvcProject, default remote --> s3://mypublicbucket)
----PublicDataset1.dvc, PublicDataset2.dvc
--private/
----PrivateDatasetA/ (DvcProject, default remote --> s3://myprivateA
------PrivateDatasetA.dvc
----PrivateDatasetB (DvcProject, default remote --> s3://myprivateB
------PrivateDatasetB.dvc

If there was a possibility to apply default remote by mask or bind them to .dvc files, then it would be possible to simplify the structure, having a single Dvc Project with different remotes for different datasets

Related to iterative#2095

Related to #2095

johan-sightic · 2022-04-19T05:51:43Z

@pared I (discord) would use this feature

dberenbaum · 2022-04-19T10:44:58Z

Closing in favor of #7594 since the initial request was implemented in #6486.

efiop added the feature request Requesting a new feature label Jun 5, 2019

efiop added p4-not-important p3-nice-to-have It should be done this or next sprint and removed p4-not-important labels Jul 22, 2019

jorgeorpinel added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Nov 20, 2019

jorgeorpinel changed the title ~~Specify file types that can be pushed to remote~~ Specify what should be pushed to different remotes Nov 20, 2019

shcheklein mentioned this issue Nov 20, 2019

Support pulling named subsets of data, or excluding files from pull #2825

Closed

Suor mentioned this issue Feb 3, 2020

Support for multiple .dvc roots in a single git repo #2349

Closed

dmpetrov mentioned this issue Feb 7, 2020

store whole DAG in one DVC-file #1871

Closed

jorgeorpinel mentioned this issue Apr 2, 2020

push: option to push to all remotes? #3578

Closed

efiop added the research label Jul 28, 2020

pared mentioned this issue Aug 18, 2020

Introducing cache types: data, metrics and plots, run-cache and per-file #4040

Closed

shcheklein mentioned this issue Sep 22, 2020

Add the option for imports to be backed up to the remote #4581

Closed

2 tasks

jorgeorpinel mentioned this issue Sep 23, 2020

config: set cache per project at a global/system level #4519

Closed

efiop self-assigned this Aug 16, 2021

efiop added a commit to efiop/dvc that referenced this issue Aug 24, 2021

dvcfile: support remote per output

9db4fee

Related to iterative#2095

efiop mentioned this issue Aug 24, 2021

dvcfile: support remote per output #6486

Merged

efiop added a commit that referenced this issue Aug 27, 2021

dvcfile: support remote per output (#6486)

4382b80

Related to #2095

dberenbaum mentioned this issue Nov 2, 2021

example-dvc-experiments: Improvements iterative/example-repos-dev#85

Closed

2 tasks

dberenbaum closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify what should be pushed to different remotes #2095

Specify what should be pushed to different remotes #2095

nirtiac commented Jun 5, 2019

Suor commented Jun 11, 2019 •

edited

Loading

nirtiac commented Jun 11, 2019

shcheklein commented Jun 11, 2019

jorgeorpinel commented Nov 20, 2019

Suor commented Nov 20, 2019

dmpetrov commented Nov 30, 2019

nirtiac commented Dec 9, 2019

z0u commented Jan 7, 2020

pared commented Jan 7, 2020 •

edited

Loading

zimka commented Apr 17, 2021

johan-sightic commented Apr 19, 2022 •

edited

Loading

dberenbaum commented Apr 19, 2022

Specify what should be pushed to different remotes #2095

Specify what should be pushed to different remotes #2095

Comments

nirtiac commented Jun 5, 2019

Suor commented Jun 11, 2019 • edited Loading

nirtiac commented Jun 11, 2019

shcheklein commented Jun 11, 2019

jorgeorpinel commented Nov 20, 2019

Suor commented Nov 20, 2019

dmpetrov commented Nov 30, 2019

nirtiac commented Dec 9, 2019

z0u commented Jan 7, 2020

pared commented Jan 7, 2020 • edited Loading

zimka commented Apr 17, 2021

johan-sightic commented Apr 19, 2022 • edited Loading

dberenbaum commented Apr 19, 2022

Suor commented Jun 11, 2019 •

edited

Loading

pared commented Jan 7, 2020 •

edited

Loading

johan-sightic commented Apr 19, 2022 •

edited

Loading