-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify what should be pushed to different remotes #2095
Comments
Sounds like this is not about file types, but a need to specify what to push in general. What if you have files of same type ones with sensitive data and others not? You have this nice property the latter being models and the former being something else, which won't always be the case. |
Yep, very true @Suor - that would be even better! |
so, add a flag per output, something like |
It seems we hear this feature request often on the chat, and there's even a very similar SO question: How to use different remotes for different folders? The current solution is to setup the remotes, and then use As suggested in this issue though, it would be useful to setup a map of certain outputs/directories/file types to specific default remotes. For example we could add a |
Mapping could be done two ways:
|
@nirtiac for data from different storages or with different access permissions, you can use new DVC functionality -
Technically, this solution looks more or less the same as the proposed solution with output types (@shcheklein ) or a working solution with specified remotes (@jorgeorpinel ) but it provides a higher-level abstraction (dataset repo, ML model repo) and should be a better choice for users. |
This is great, thank you @dmpetrov ! |
I'm after something similar. I need to train a model on sensitive data. I was intending to split the training into two phases:
There would be two remotes:
When training, the developer would specify which data directory to use. Does this seem reasonable? I guess I would need specify the remote when pulling and pushing the data directories, as @jorgeorpinel said? |
@z0u your use case could be fulfilled using
If it will be of any help, here is bash script showing how that could look:
It might seem intimidating on first sight, but take a note that most of the script code is to create repos and navigate around git commits. [EDIT] |
I have a somewhat related problem now, where I want to organize a data registry that contains both public and private datasets, which would be used by multiple devs. Private datasets contain sensitive information, which should not be accessible for devs without necessity, while public datasets are accessible for everyone in the company. Every private dataset is different and independent, which means that it is a valid scenario, where someone has access to the PrivateDatasetA, but not PrivateDatasetB and vice versa.
If there was a possibility to apply default remote by mask or bind them to .dvc files, then it would be possible to simplify the structure, having a single Dvc Project with different remotes for different datasets |
I need to be able to push versioned trained models to a production server. Both data and models will be created on one machine and versioned using DVC. However, I have sensitive information in my training server data that must never be pushed off the server.
As it stands, we are currently figuring out how to push models to a remote that can be tied back to their versioned data/code within DVC while never also pushing the data. This requires a solution outside of DVC. It would be great if DVC could provide the capability to specify file types that can or can't be pushed to a remote once committed.
The text was updated successfully, but these errors were encountered: