diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 453a40cbfe..be5a542328 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -1,24 +1,21 @@ # Data Registry One of the main uses of DVC repositories is the -[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning). -This is provided by commands such as `dvc add` and `dvc run`, that allow -tracking of datasets or any other data artifacts. - -With the aim to enable reusability of these versioned artifacts between -different projects, DVC also includes the `dvc get`, `dvc import`, and -`dvc update` commands. This means that a project can depend on data from an -external DVC project, similar to package management systems, but -for data. +[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning), +with commands such as `dvc add`. With the aim to enable reusability of these +data artifacts between different projects, DVC also provides the +`dvc get`, `dvc import`, and `dvc update` commands. This means that a project +can depend on data from an external DVC project, **similar to +package management systems, but for data**. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any large data). This way we would have a -repository with all the metadata and history of changes of the project's data. -We could see who updated what, and when, use pull requests to update data (the -same way we do with code). This is what we call a data registry, and it works as -data management middleware between your ML project and cloud storage. +repository with all the metadata and history of changes of different datasets. +We could see who updated what, and when, and use pull requests to update data +(the same way we do with code). This is what we call a **data registry**, which +can work as data management _middleware_ between ML projects and cloud storage. Advantages of using a DVC **data registry** project: @@ -44,9 +41,9 @@ Advantages of using a DVC **data registry** project: ## Building data registries -A data registry is a kind of DVC repository, so it can be created -locally like to any other Git + DVC project. However, the registry -should be available online, so it must pushed to a Git server: +Data registries are DVC repositories, so they can be created +locally like any other Git + DVC project. However, registries +should be available online i.e. pushed to a Git server. For example: ```dvc $ mkdir my-data-registry && cd my-data-registry @@ -57,15 +54,16 @@ $ git branch -u origin/master $ git push ``` -What will make the online registry special, is that it will mainly contain -[DVC-files](/doc/user-guide/dvc-file-format). These will track the different -datasets we want to version. The actual data will be stored in one or more -[remote storage](/doc/command-reference/remote) locations configured in the -project. +What makes online data registries special, is that they mainly contain simple +[DVC-files](/doc/user-guide/dvc-file-format) (probably no source code or +[pipelines](/doc/command-reference/pipeline)). These [DVC-files track the +different datasets we may want to version. The actual data will be stored in one +or more [remote storage](/doc/command-reference/remote) locations configured in +the project. A good way to organize these DVC-files is in different directories that group -the data artifacts for different uses, for example `images/`, -`natural-language/`, etc. As an example, our +the data into separate uses, for example `images/`, `natural-language/`, etc. As +an example, our [dataset-registry](https://github.com/iterative/dataset-registry) uses a directory for each of our website documentation sections, such as `get-started/` and `use-cases/`. @@ -75,7 +73,12 @@ and `use-cases/`. > [in Get Started](/doc/get-started/add-files), and some Command Reference > examples. -### Adding datasets to the registry +### Adding datasets to a registry + + + + + Imagine a training dataset with 1000 images of cats and dogs that will be used to build an ML model. Without DVC, in order for a team to collaborate on this