Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc get/import support for subrepos #3369

Closed
ehutt opened this issue Feb 20, 2020 · 8 comments · Fixed by #4465
Closed

dvc get/import support for subrepos #3369

ehutt opened this issue Feb 20, 2020 · 8 comments · Fixed by #4465
Assignees
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do

Comments

@ehutt
Copy link

ehutt commented Feb 20, 2020

UPDATE: See #3369 (comment) for main purpose of this issue


If I have a data registry monorepo with several subrepos inside it, each pointing to it's own remote storage path, I want to be able to import and/or get data from a specific subrepo/remote location.

My situation: I have an s3 bucket, s3://datasets/ where I store multiple datasets, say audio/ and text/. I want to track these data using dvc and a single, git-based data registry. Then I want to be able to selectively push and pull/import data from s3 through this registry for use in my other dvc projects.

So, I can make subrepos for audio/ and text/, each initialized with its own dvc file and remote, and push data to s3 this way. Then, if I want to only download the audio data into a new project, then I can run something like dvc import [email protected]/datasets/audio and it will automatically pull from the correct path in s3, corresponding to the default remote for the audio subrepo.

Thank you!

@shcheklein shcheklein transferred this issue from iterative/dvc.org Feb 20, 2020
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 20, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 20, 2020

Hi @ehutt

I have a data registry monorepo with several subrepos inside it, each pointing to it's own remote storage

At first I thought this was essentially the same as #2349, but I see its not about subdirectories but rather about Git subrepos that are DVC projects in a parent Git repo (which is not a DVC project). Am I correct? Maybe if you can provide a sample file hierarchy including .git and .dvc dirs, this could be cleared up.

I have an s3 bucket where I store multiple datasets, say audio/ and text/. I want to track these using dvc...
So, I can make subrepos for audio/ and text/, each initialized with its own dvc file and remote...

I'm not sure you need to use separate DVC projects for each sub-dataset. You may be overengineering a little 🙂 You could first try to follow the simple pattern explained in https://dvc.org/doc/use-cases/data-registries which just tracks multiple directories in a single DVC repo.

BTW, since your data is already in S3, you probably want to dvc add the sub-datasets as external outputs in your data registry.

Then I want to be able to selectively push and pull/import data from s3 through this registry for use in my other dvc projects...
then I can run something like dvc import [email protected]/datasets/audio and it will automatically pull from the correct path in s3

Yep. dvc get or dvc import would be the commands for this. They take both an url (the data reg repo) as well as a path (each data artifact i.e. audio or text) arguments, so you can be selective with that combination. (Pulling and pushing are for synchronizing data between the cache of a DVC project and its remote storage.)

@jorgeorpinel
Copy link
Contributor

Hmmm I just noticed there's an entire conversation on this in https://discordapp.com/channels/485586884165107732/485596304961962003/679842782046584883

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 20, 2020

All that said, even if a simple data registry (single DVC+Git repo with subdirs) solves @ehutt's problem, given SCM: allow multiple DVC repos inside single SCM repo #3257 (WIP), checking that dvc get and dvc import continue to work in this kind of DVC sub-projects seems important as well.

@skshetry skshetry added enhancement Enhances DVC feature request Requesting a new feature labels Feb 20, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 20, 2020
@efiop efiop added the p1-important Important, aka current backlog of things to do label Feb 20, 2020
@ehutt
Copy link
Author

ehutt commented Feb 20, 2020

Hi @jorgeorpinel thanks for the swift response.

Hmmm I just noticed there's an entire conversation on this in https://discordapp.com/channels/485586884165107732/485596304961962003/679842782046584883

That's me. @efiop recommended that I use subrepos that are in the same git repo but dvc initialized independently. This will only be helpful to me if dvc get and dvc import support dvc subrepos, where there may be multiple dvc files, one for each dataset rather than at the root.

The problem I face now is that dvc import and dvc get require a default remote to be defined, but my data registry contains multiple datasets, all stored in different paths of an s3 bucket. I do not want to import an entire s3 bucket whenever I start a new project nor do I want to have separate git repos for every dataset. This is the dilemma.

BTW, since your data is already in S3, you probably want to dvc add the sub-datasets as external outputs in your data registry.

To this point, if I keep all of the data in s3 and manage it as external dependencies/outputs, does this defeat the point of having a github data registry? For example, one of my use cases involves taking raw audio, passing it through an ASR model, and then saving the resulting transcripts. Those transcripts will then be used for downstream NLP tasks. I want to be able to track which audio and model version were used to generate the transcripts, and then link that tracked info to the results of any downstream NLP tasks. If I can do this without the data ever leaving s3, then that is all the better. I just want to make sure the pipeline can be reproduced and the data are versioned.

I am building this data science/ml workflow from scratch and just trying to figure out the best way forward to effectively version our data/models/experiments for optimal reproducibility and clarity. I have not settled on a single way to organize our s3 storage or data registry so I am open to suggestions as to what you have seen work in your experience.

Thanks for your help

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 20, 2020

The problem I face now is that dvc import and dvc get require a default remote to be defined, but my data registry contains multiple datasets, all stored in different paths of an s3 bucket

Ahhh good point! 🙁 It's a common pattern for dataset registry repos to use several remotes and not set any one as default actually, so you don't accidentally pull/push to the wrong one. So this one I see as a great enhancement request, let's see what the core team thinks: #3371

if I keep all of the data in s3 and manage it as external dependencies/outputs, does this defeat the point of having a github data registry? ... If I can do this without the data ever leaving s3, then that is all the better.

It doesn't defeat the point because you're augmenting an existing data repository with DVC's versioning and pipelining features. But please keep in mind that using external dependencies/outputs changes your workflow: DVC doesn't copy them to your local project or provide any interface for your scripts to access the files as if they were local. Your code will have to connect to S3 and stream or download the data before it can read or write it.

Also note that external outputs specified to dvc run require an external cache setup in the same remote location - I think, double checking on Discord.

I am building this data science/ml workflow from scratch...

Maybe downloading the datasets locally and adding them to DVC (one by one if needed – you can delete them locally once no longer needed [related: https://github.com//issues/3355#issuecomment-587554733]) is the most straightforward way to build the dataset, to avoid need for external data shenanigans.

@ehutt
Copy link
Author

ehutt commented Feb 21, 2020

Ahhh good point! 🙁 It's a common pattern for dataset registry repos to use several remotes and not set any one as default actually, so you don't accidentally pull/push to the wrong one. So this one I see as a great enhancement request, let's see what the core team thinks: #3371

Yes! This is exactly what I need. I noticed other people requesting this feature and thought you weren't planning to implement it. But simply allowing for a remote to be specified (rather than requiring a default) when calling dvc import would solve all my problems.

We may eventually change the workflow to use external dependencies so as to avoid people making unnecessary copies of the data that's on s3 (for privacy/security reasons when working with customer data), but for now I think the data registry + support for importing from other remotes is the most elegant solution.

Any idea if/when you will add this feature? I will be eagerly awaiting any updates!

@jorgeorpinel
Copy link
Contributor

@ehutt please follow (and upvote with 👍) #3371 for updates on that 🙂

But indeed, it sounds like the teams inclination for the short term is to provide a --remote option for get and import. I find this a workaround but should do the trick. (Please see/follow #2466)

The other WIP solution (probably the one that will be ready first) is that dvc init will have a --subdir option soon to let you create DVC preojects in subdirectories of a parent DVC project/Git repo (See #3257 PR). This would change your workflow more along the lines of the chat you had with Ruslan on Discord though.

@pared pared self-assigned this Mar 6, 2020
@pared pared removed their assignment Mar 6, 2020
@skshetry skshetry self-assigned this Jul 2, 2020
@efiop efiop mentioned this issue Aug 4, 2020
1 task
skshetry added a commit to skshetry/dvc that referenced this issue Sep 2, 2020
This adds support for get/list/api/import/ls on a subrepo.

Also fixes iterative#3180, by adding a granular url support for `dvc.api.get_url`
And, of course it fixes iterative#3369.
efiop pushed a commit that referenced this issue Sep 3, 2020
* support subrepos

This adds support for get/list/api/import/ls on a subrepo.

Also fixes #3180, by adding a granular url support for `dvc.api.get_url`
And, of course it fixes #3369.

* Add tests

* Fix tests, started to get lock_error

* Fix tests for windows

* try to fix mac builds

* Fix comments, readd removed comments

* Fix test

* address review suggestions:

* use `spy` instead of `wraps`
* remove dvc fixture in a test
* set `cache_types` on Git repo
@skshetry
Copy link
Member

@ehutt, get/import for subrepos is in the latest version, i.e. 1.7.0. Please, let us know if there's any issues. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants