Data cohorts #10377

vanossj · 2024-04-01T21:47:30Z

vanossj
Apr 1, 2024

Is there a way to have cohorts of data without duplication?

Here is a scenario:

There is a dataset of 10k images on a shared development server. Maybe in a data registry?

Tom wants to train a model on an 8k image subset, and Mary wants to train a model on a 6k image subset.

Is there a way to specify which 8k and which 6k images each research is referring to from the full 10k dataset? or would the 8k and 6k 'cohorts' be different data registries?

shcheklein · 2024-04-03T00:52:44Z

shcheklein
Apr 3, 2024
Maintainer

@vanossj hi! DVC de-duplicates files by their content. So, let's say if you have two directories - one with 6K and another one with the same 6K + 2K on top of it and you do dvc add dir1, dvc add dir2 - DVC won't save the first 6K twice - it'll "know" that they already exist.

More on how is it stored can be found here.

So, to answer your question - you could just create two versions of the dataset in the same data registry and Tom and Mary could you them independently in their projects.

A good question is how they can pick the "right" 6K and 8K in the first place though. This is not something DVC can help atm, but we'll have it soon as part of the upcoming DVCx. It would be great to run it by you. If you have time, let's jump on call and discuss it. It would be great to learn about your workflow, and hopefully we can lear from each other. Ping me at ivan @ iterative.ai .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data cohorts #10377

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Data cohorts #10377

vanossj Apr 1, 2024

Replies: 1 comment

shcheklein Apr 3, 2024 Maintainer

vanossj
Apr 1, 2024

shcheklein
Apr 3, 2024
Maintainer