-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc get/import support for subrepos #3369
Comments
Hi @ehutt
At first I thought this was essentially the same as #2349, but I see its not about subdirectories but rather about Git subrepos that are DVC projects in a parent Git repo (which is not a DVC project). Am I correct? Maybe if you can provide a sample file hierarchy including .git and .dvc dirs, this could be cleared up.
I'm not sure you need to use separate DVC projects for each sub-dataset. You may be overengineering a little 🙂 You could first try to follow the simple pattern explained in https://dvc.org/doc/use-cases/data-registries which just tracks multiple directories in a single DVC repo. BTW, since your data is already in S3, you probably want to
Yep. |
Hmmm I just noticed there's an entire conversation on this in https://discordapp.com/channels/485586884165107732/485596304961962003/679842782046584883 |
Hi @jorgeorpinel thanks for the swift response.
That's me. @efiop recommended that I use subrepos that are in the same git repo but dvc initialized independently. This will only be helpful to me if The problem I face now is that
To this point, if I keep all of the data in s3 and manage it as external dependencies/outputs, does this defeat the point of having a github data registry? For example, one of my use cases involves taking raw audio, passing it through an ASR model, and then saving the resulting transcripts. Those transcripts will then be used for downstream NLP tasks. I want to be able to track which audio and model version were used to generate the transcripts, and then link that tracked info to the results of any downstream NLP tasks. If I can do this without the data ever leaving s3, then that is all the better. I just want to make sure the pipeline can be reproduced and the data are versioned. I am building this data science/ml workflow from scratch and just trying to figure out the best way forward to effectively version our data/models/experiments for optimal reproducibility and clarity. I have not settled on a single way to organize our s3 storage or data registry so I am open to suggestions as to what you have seen work in your experience. Thanks for your help |
Ahhh good point! 🙁 It's a common pattern for dataset registry repos to use several remotes and not set any one as default actually, so you don't accidentally pull/push to the wrong one. So this one I see as a great enhancement request, let's see what the core team thinks: #3371
It doesn't defeat the point because you're augmenting an existing data repository with DVC's versioning and pipelining features. But please keep in mind that using external dependencies/outputs changes your workflow: DVC doesn't copy them to your local project or provide any interface for your scripts to access the files as if they were local. Your code will have to connect to S3 and stream or download the data before it can read or write it. Also note that external outputs specified to
Maybe downloading the datasets locally and adding them to DVC (one by one if needed – you can delete them locally once no longer needed [related: https://github.com//issues/3355#issuecomment-587554733]) is the most straightforward way to build the dataset, to avoid need for external data shenanigans. |
Yes! This is exactly what I need. I noticed other people requesting this feature and thought you weren't planning to implement it. But simply allowing for a remote to be specified (rather than requiring a default) when calling We may eventually change the workflow to use external dependencies so as to avoid people making unnecessary copies of the data that's on s3 (for privacy/security reasons when working with customer data), but for now I think the data registry + support for importing from other remotes is the most elegant solution. Any idea if/when you will add this feature? I will be eagerly awaiting any updates! |
@ehutt please follow (and upvote with 👍) #3371 for updates on that 🙂 But indeed, it sounds like the teams inclination for the short term is to provide a The other WIP solution (probably the one that will be ready first) is that |
This adds support for get/list/api/import/ls on a subrepo. Also fixes iterative#3180, by adding a granular url support for `dvc.api.get_url` And, of course it fixes iterative#3369.
* support subrepos This adds support for get/list/api/import/ls on a subrepo. Also fixes #3180, by adding a granular url support for `dvc.api.get_url` And, of course it fixes #3369. * Add tests * Fix tests, started to get lock_error * Fix tests for windows * try to fix mac builds * Fix comments, readd removed comments * Fix test * address review suggestions: * use `spy` instead of `wraps` * remove dvc fixture in a test * set `cache_types` on Git repo
UPDATE: See #3369 (comment) for main purpose of this issue
If I have a data registry monorepo with several subrepos inside it, each pointing to it's own remote storage path, I want to be able to import and/or get data from a specific subrepo/remote location.
My situation: I have an s3 bucket,
s3://datasets/
where I store multiple datasets, sayaudio/
andtext/
. I want to track these data using dvc and a single, git-based data registry. Then I want to be able to selectively push and pull/import data from s3 through this registry for use in my other dvc projects.So, I can make subrepos for
audio/
andtext/
, each initialized with its own dvc file and remote, and push data to s3 this way. Then, if I want to only download the audio data into a new project, then I can run something likedvc import [email protected]/datasets/audio
and it will automatically pull from the correct path in s3, corresponding to the default remote for the audio subrepo.Thank you!
The text was updated successfully, but these errors were encountered: