Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: improvements to data-registry case per Alex' review #805

Closed
wants to merge 8 commits into from
33 changes: 14 additions & 19 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,25 @@ tracking of datasets and any other <abbr>data artifacts</abbr>.

With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the `dvc get`, `dvc import`, and `dvc update` commands. For
example, project A may use a data file to begin its data
[pipeline](/doc/command-reference/pipeline), but project B also requires this
same file; Instead of
[adding it](/doc/command-reference/add#example-single-file) it to both projects,
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.
also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means
that a project can depend on data from an external <abbr>DVC project</abbr>, but
chaining several projects this way can easily become messy...
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
downloading (`dvc get`) or importing (`dvc import`) them for use in different
data processes.
have a repository with all the metadata and history of changes in the project's
data. We could see who updated what, and when, use pull requests to update data
(the same way we do with code), and avoid ad-hoc conventions to store different
data versions. This is what we call a data registry. Other projects can share
datasets in a registry by downloading (`dvc get`) or importing (`dvc import`)
them for use in different data processes.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The advantages of using a DVC **data registry** project are:
Advantages of using a DVC **data registry** project:

- Data as code: Improve _lifecycle management_ with versioning of simple
directory structures (like Git for your cloud storage), without ad-hoc
conventions. Leverage Git and Git hosting features such as change history,
branching, pull requests, reviews, and even continuous deployment of ML
models.
conventions. Leverage Git and Git hosting features such as commits, branching,
pull requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize _feature stores_ with a simple CLI
(`dvc get` and `dvc import` commands, similar to software package management
systems like `pip`).
Expand All @@ -49,8 +44,8 @@ The advantages of using a DVC **data registry** project are:

## Example

A dataset we use for several of our examples and tutorials is one containing
2800 images of cats and dogs. We partitioned the dataset in two for our
A dataset we use for several of our examples and tutorials contains 2800 images
of cats and dogs. We split the dataset in two for our
[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a
storage server, downloading them with `wget` in our examples. This setup was
then revised to download the dataset with `dvc get` instead, so we created the
Expand Down