Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify what should be pushed to different remotes #2095

Closed
nirtiac opened this issue Jun 5, 2019 · 12 comments
Closed

Specify what should be pushed to different remotes #2095

nirtiac opened this issue Jun 5, 2019 · 12 comments
Assignees
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important research

Comments

@nirtiac
Copy link

nirtiac commented Jun 5, 2019

I need to be able to push versioned trained models to a production server. Both data and models will be created on one machine and versioned using DVC. However, I have sensitive information in my training server data that must never be pushed off the server.

As it stands, we are currently figuring out how to push models to a remote that can be tied back to their versioned data/code within DVC while never also pushing the data. This requires a solution outside of DVC. It would be great if DVC could provide the capability to specify file types that can or can't be pushed to a remote once committed.

@efiop efiop added the feature request Requesting a new feature label Jun 5, 2019
@Suor
Copy link
Contributor

Suor commented Jun 11, 2019

Sounds like this is not about file types, but a need to specify what to push in general. What if you have files of same type ones with sensitive data and others not? You have this nice property the latter being models and the former being something else, which won't always be the case.

@nirtiac
Copy link
Author

nirtiac commented Jun 11, 2019

Yep, very true @Suor - that would be even better!

@shcheklein
Copy link
Member

so, add a flag per output, something like --out-local to avoid pushing it to remote at all? it should be pretty straightforward. My only concern that number of different options for different types of outputs double every time. We should think about some general mechanism to specify options per output?

@efiop efiop added p4-not-important p3-nice-to-have It should be done this or next sprint and removed p4-not-important labels Jul 22, 2019
@jorgeorpinel jorgeorpinel added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Nov 20, 2019
@jorgeorpinel jorgeorpinel changed the title Specify file types that can be pushed to remote Specify what should be pushed to different remotes Nov 20, 2019
@jorgeorpinel
Copy link
Contributor

It seems we hear this feature request often on the chat, and there's even a very similar SO question: How to use different remotes for different folders?
: "I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models."

The current solution is to setup the remotes, and then use dvc push -r {REMOTE} {data/model} individually for each data/model file.

As suggested in this issue though, it would be useful to setup a map of certain outputs/directories/file types to specific default remotes. For example we could add a --out option to dvc remote default (somehow recorded in .dvc/config) so DVC knows that one output file should be pushed to a specific remote by default (unless overwritten with dvc push -r, of course. The option could be called something else like --map and accept not only outputs but any file/dir path or even file type filter (e.g. something like **/*.csv).

@Suor
Copy link
Contributor

Suor commented Nov 20, 2019

Mapping could be done two ways:

  • specify outputs, possibly patterns, in a remote config
  • specify remote in an output definition in a stage file

@dmpetrov
Copy link
Member

@nirtiac for data from different storages or with different access permissions, you can use new DVC functionality - dvc import. This works only in multiple-repository environment (not monorepo) but it potentially gives you an elegant solution of separating data remotes.

  1. The dataset can be presented in a separate git-repo with its own remote. Repo: https://github.com/mycorp/rawdata. Remote like: s3://rawdata-bucket
  2. Modeling code with your ML model in another repo like https://github.com/mycorp/predict. Remote like: s3://predict-bucket.
  3. You just import dataset from your model repo dvc import https://github.com/mycorp/rawdata images.
  4. Now, the model repo has data from different data remotes. dvc push from the model repo won't push the imported dataset into the model remote.
  5. You can share only your model repo predict for the ML model consumers (prod environment for example) and access key for the model bucket s3://predict-bucket. It prevents consumers from using source dataset - s3://rawdata-bucket.

Technically, this solution looks more or less the same as the proposed solution with output types (@shcheklein ) or a working solution with specified remotes (@jorgeorpinel ) but it provides a higher-level abstraction (dataset repo, ML model repo) and should be a better choice for users.

@nirtiac
Copy link
Author

nirtiac commented Dec 9, 2019

This is great, thank you @dmpetrov !

@z0u
Copy link

z0u commented Jan 7, 2020

I'm after something similar. I need to train a model on sensitive data. I was intending to split the training into two phases:

  1. Dev: fast iteration on synthetic/scrubbed data. Training would be done on a developer's machine. The model could be served locally and used for contract tests.

  2. Prod: final model trained on real data. Training would happen in a secure environment.

There would be two remotes: data-dev and data-prod (e.g. two S3 buckets). To prevent cross-pollination, only one of the remotes should be used from each environment. So maybe a project structured like this:

foo-training/ -> [email protected]/foo-training.git
    data-dev/ -> s3://data-dev/foo/
        training-examples.dvc
        model.dvc
    data-prod/ -> s3://data-prod/foo/
        training-examples.dvc
        model.dvc
    src/...

When training, the developer would specify which data directory to use. Does this seem reasonable? I guess I would need specify the remote when pulling and pushing the data directories, as @jorgeorpinel said?

@pared
Copy link
Contributor

pared commented Jan 7, 2020

@z0u your use case could be fulfilled using dvc import-url command.
Such workflow could be described as follows:

  1. Create dev_repo, use dvc import-url to obtain development data
  2. Write training code, commit to your git repo when finished
  3. [On prod env] clone repo, create branch for prod training
  4. [On prod env] override data using dvc import-url again, this time import real data
  5. [On prod env] run dvc repro to rerun all steps from pipeline with new data

If it will be of any help, here is bash script showing how that could look:

#!/bin/bash

rm -rf dev_repo storage data_dev data_prod git_repo prod_repo
mkdir dev_repo storage data_dev data_prod git_repo

echo data_development >> data_dev/data_file_dev
echo data_production >> data_prod/data_file_prod
maindir=$(pwd)

# create empty repo
pushd git_repo
git init --bare --quiet
popd

# create dev environment
pushd dev_repo
git init --quiet && dvc init --quiet
git remote add origin $maindir/git_repo

git add -A
git commit -am "dvc init"
git push origin master

#import dev_data and create pipeline using dev_data
dvc import-url $maindir/data_dev data
git add data.dvc .gitignore
git commit -am "add data"

# write code
echo -e "import os\nprint(os.listdir('data'))\nwith open('pipeline_result', 'w') as fd:\n    fd.write(str(os.listdir('data')))">>code.py

# create dvc pipeline
dvc run -d data -d code.py -o pipeline_result python code.py

#commit progress
git add pipeline_result.dvc .gitignore code.py
git commit -am "write code, run pipeline"
git push origin master

echo '#########################'
echo 'Pipeline result content:'
cat pipeline_result
echo -e '\n#########################'

popd

# clone git repository to safe production environment
git clone git_repo prod_repo
pushd prod_repo

# create branch to use production data
git checkout -b with_prod_data

# note that we are overriding data with production data
dvc import-url $maindir/data_prod data
git commit -am "import prod data"

# rerun whole pipeline to make use of new data
dvc repro pipeline_result.dvc
git commit -am "retrained with prod data"
git push origin master

echo '#########################'
echo 'Pipeline result content:'
cat pipeline_result
echo -e '\n#########################'

It might seem intimidating on first sight, but take a note that most of the script code is to create repos and navigate around git commits.
The most important part is dvc import-url $maindir/data_prod data, where we re-import data folder (with production data) and dvc repro pipeline_result.dvc where we rerun the pipeline with new data.

[EDIT]
@z0u Also, please take note of @dmpetrov comment about creating separate repository for data, that could be also useful if you need strict separation of prod/dev data.

@zimka
Copy link

zimka commented Apr 17, 2021

I have a somewhat related problem now, where I want to organize a data registry that contains both public and private datasets, which would be used by multiple devs. Private datasets contain sensitive information, which should not be accessible for devs without necessity, while public datasets are accessible for everyone in the company. Every private dataset is different and independent, which means that it is a valid scenario, where someone has access to the PrivateDatasetA, but not PrivateDatasetB and vice versa.
Essentially I'd like to be able to grant devs with read/write permissions in many-to-many style.
It can not be done out-of-box AFAIK, and for now the most suitable workaround from my point of view is based on using multiple remote storages. Most of remote types (such as S3, SSH, or local-network) provide some way to manage permissions, so it is possible to grant different remote permissions to different devs. There is a problem however, that to use such workaround different datasets (PublicDataset1.dvc and PrivateDatasetA.dvc) must have different remotes. Also to use the whole repo as a data registry these remotes must be marked as default, which is not possible since there is only one default remote for repo. So for now either multiple subdir dvc projects in the same repo should be used or multiple repos. So here is an example structure of such registry repo:

MyDvcRegistry/ (https://github.com/me/MyDvcRegistry)
--public/ (DvcProject, default remote --> s3://mypublicbucket)
----PublicDataset1.dvc, PublicDataset2.dvc
--private/
----PrivateDatasetA/ (DvcProject, default remote --> s3://myprivateA
------PrivateDatasetA.dvc
----PrivateDatasetB (DvcProject, default remote --> s3://myprivateB
------PrivateDatasetB.dvc

If there was a possibility to apply default remote by mask or bind them to .dvc files, then it would be possible to simplify the structure, having a single Dvc Project with different remotes for different datasets

@efiop efiop self-assigned this Aug 16, 2021
efiop added a commit to efiop/dvc that referenced this issue Aug 24, 2021
efiop added a commit that referenced this issue Aug 27, 2021
@johan-sightic
Copy link

johan-sightic commented Apr 19, 2022

@pared I (discord) would use this feature

@dberenbaum
Copy link
Collaborator

Closing in favor of #7594 since the initial request was implemented in #6486.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

No branches or pull requests