Replies: 1 comment
-
@vanossj hi! DVC de-duplicates files by their content. So, let's say if you have two directories - one with 6K and another one with the same 6K + 2K on top of it and you do More on how is it stored can be found here. So, to answer your question - you could just create two versions of the dataset in the same data registry and Tom and Mary could you them independently in their projects. A good question is how they can pick the "right" 6K and 8K in the first place though. This is not something DVC can help atm, but we'll have it soon as part of the upcoming DVCx. It would be great to run it by you. If you have time, let's jump on call and discuss it. It would be great to learn about your workflow, and hopefully we can lear from each other. Ping me at ivan @ iterative.ai . |
Beta Was this translation helpful? Give feedback.
-
Is there a way to have cohorts of data without duplication?
Here is a scenario:
There is a dataset of 10k images on a shared development server. Maybe in a data registry?
Tom wants to train a model on an 8k image subset, and Mary wants to train a model on a 6k image subset.
Is there a way to specify which 8k and which 6k images each research is referring to from the full 10k dataset? or would the 8k and 6k 'cohorts' be different data registries?
Beta Was this translation helpful? Give feedback.
All reactions