Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: improvements to data-registry case per Alex' review #805

Closed
wants to merge 8 commits into from
2 changes: 1 addition & 1 deletion static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ different names, and not currently tracked by Git:
$ git status
...
Untracked files:
(use "git add <file>..." to include in what will be committed)
(use "git add <file> ..." to include in what will be committed)

model.bigrams.pkl
model.monograms.pkl
Expand Down
7 changes: 3 additions & 4 deletions static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ checkout the `6-featurization` tag:
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

$ dvc status

Expand Down Expand Up @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference.
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

HEAD is now at d13ba9a add featurization stage

Expand Down Expand Up @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the

```dvc
$ dvc repro evaluate.dvc

... much output
...
To track the changes with git run:

git add featurize.dvc train.dvc evaluate.dvc
Expand Down
2 changes: 1 addition & 1 deletion static/docs/tutorials/deep/reproducibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc
$ dvc repro
```

Tries to reproduce the same pipeline... But there is still nothing to reproduce.
Tries to reproduce the same pipeline, but there is still nothing to reproduce.

## Adding bigrams

Expand Down
133 changes: 65 additions & 68 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,24 @@ tracking of datasets and any other <abbr>data artifacts</abbr>.

With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the `dvc get`, `dvc import`, and `dvc update` commands. For
example, project A may use a data file to begin its data
[pipeline](/doc/command-reference/pipeline), but project B also requires this
same file; Instead of
[adding it](/doc/command-reference/add#example-single-file) it to both projects,
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.
also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means
that a project can depend on data from an external <abbr>DVC project</abbr>.

Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
downloading (`dvc get`) or importing (`dvc import`) them for use in different
data processes.
have a repository with all the metadata and history of changes in the project's
data. We could see who updated what, and when, use pull requests to update data
(the same way we do with code), and avoid ad-hoc conventions to store different
data versions. This is what we call a data registry. Other projects can share
datasets in a registry by downloading (`dvc get`) or importing (`dvc import`)
them for use in different data processes.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The advantages of using a DVC **data registry** project are:
Advantages of using a DVC **data registry** project:

- Data as code: Improve _lifecycle management_ with versioning of simple
directory structures (like Git for your cloud storage), without ad-hoc
conventions. Leverage Git and Git hosting features such as change history,
branching, pull requests, reviews, and even continuous deployment of ML
models.
conventions. Leverage Git and Git hosting features such as commits, branching,
pull requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize _feature stores_ with a simple CLI
(`dvc get` and `dvc import` commands, similar to software package management
systems like `pip`).
Expand All @@ -49,29 +43,30 @@ The advantages of using a DVC **data registry** project are:

## Example

A dataset we use for several of our examples and tutorials is one containing
2800 images of cats and dogs. We partitioned the dataset in two for our
[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a
storage server, downloading them with `wget` in our examples. This setup was
then revised to download the dataset with `dvc get` instead, so we created the
[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a
<abbr>DVC project</abbr> hosted on GitHub, to version the dataset (see its
A dataset we commonly use for several of our examples and tutorials contains
2800 images of cats and dogs, which was split it in two for our
[Versioning Tutorial](/doc/tutorials/versioning). Originally, the parts were
backed up on a storage server, and downloaded with
[`wget`](https://www.gnu.org/software/wget/). This was then revised in order to
download the parts with `dvc get` instead, so we created the
[dataset-registry](https://github.com/iterative/dataset-registry)
<abbr>project</abbr> to version the dataset (in the
[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver)
directory).

However, there are a few problems with the way this dataset is structured. Most
importantly, this single dataset is tracked by 2 different
[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same
one, which would better reflect the intentions of this dataset... Fortunately,
we have also prepared an improved alternative in the
However, there's a few problems with the way that dataset is versioned. Most
importantly, this split dataset is tracked by 2 different
[DVC-files](/doc/user-guide/dvc-file-format) (one for each part), instead of 2
versions of a single DVC-file. An initial version could have the first part
only, while an update would have the entire, unified dataset. Fortunately, we
have also prepared this improved alternative in the
[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases)
directory of the same <abbr>DVC repository</abbr>.

To create a
[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
To create the
[initial version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
of our dataset, we extracted the first part into the `use-cases/cats-dogs`
directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory).
directory, illustrated below:

```dvc
$ tree use-cases/cats-dogs --filelimit 3
Expand All @@ -85,7 +80,10 @@ use-cases/cats-dogs
└── dogs [400 image files]
```

In a local DVC project, we could have obtained this dataset at this point with
Then we ran `dvc add use-cases/cats-dogs` to
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory).

At this point, we could have obtained this dataset in another DVC project with
the following command:

```dvc
Expand All @@ -95,15 +93,16 @@ $ dvc import [email protected]:iterative/dataset-registry.git \

> Note that unlike `dvc get`, which can be used from any directory, `dvc import`
> always needs to run from an [initialized](/doc/command-reference/init) DVC
> project.
> project. Remember also that with both commands, the data comes from the source
> project's remote storage, not from the Git repository itself.

<details>

### Expand for actionable command (optional)

The command above is meant for informational purposes only. If you actually run
it in a DVC project, although it should work, it will import the latest version
of `use-cases/cats-dogs` from `dataset-registry`. The following command would
it, although it will work, it will import the latest version of
`use-cases/cats-dogs` from `dataset-registry`. The following command would
actually bring in the version in question:

```dvc
Expand All @@ -117,54 +116,52 @@ See the `dvc import` command reference for more details on the `--rev`

</details>

Importing keeps the connection between the local project and the source data
registry where we are downloading the dataset from. This is achieved by creating
a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the
`repo` field (a.k.a. _import stage_). (This file can be used for versioning the
import with Git.)
Importing keeps the connection between the local <abbr>project</abbr> and the
data source (registry <abbr>repository</abbr>). This is achieved by creating a
particular kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import
stage_) that includes a `repo` field. (This file can be used staged and
committed with Git.)

> For a sample DVC-file resulting from `dvc import`, refer to
> [this example](/doc/command-reference/import#example-data-registry).

Back in our **dataset-registry** project, a
Back in our **dataset-registry** project, the
[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases)
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
ran `dvc add use-cases/cats-dogs` again.
images (500 cats, 500 dogs) on top of the existing directory structure. Then, we
simply ran `dvc add use-cases/cats-dogs` again.

In our local project, all we have to do in order to obtain this latest version
of the dataset is to run:
All we would have to do in order to obtain this latest version in another
project where the first version was previously imported, is to run:

```dvc
$ dvc update cats-dogs.dvc
```

This is possible because of the connection that the import stage saved among
local and source projects, as explained earlier.

<details>

### Expand for actionable command (optional)

As with the previous hidden note, actually trying the commands above should
produced the expected results, but not for obvious reasons. Specifically, the
initial `dvc import` command would have already obtained the latest version of
the dataset (as noted before), so this `dvc update` is unnecessary and won't
have an effect.
As with the previous hidden note, actually trying the command above will produce
the desired results, but not for obvious reasons. The initial `dvc import`
command would have already obtained the latest version of the dataset (as noted
before), so this `dvc update` is unnecessary and won't have any effect.

If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import
stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to
update it, do not use `dvc update`. Instead, re-import the data by using the
original import command (without `--rev`). Refer to
[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing)
for more information.
And if you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its
import stage (DVC-file) would be
[fixed to that revision](/doc/command-reference/import#example-fixed-revisions-re-importing)
(`cats-dogs-v1` tag), so `dvc update` would also be ineffective. In order to
actually "update" it, re-import the data instead, by now running the initial
import command (the one without `--rev`):

</details>
```dvc
$ dvc import [email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

This downloads new and changed files in `cats-dogs/` from the source project,
and updates the metadata in the import stage DVC-file.
</details>

As an extra detail, notice that so far our local project is working only with a
local <abbr>cache</abbr>. It has no need to setup a
[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull)
or [push](/doc/command-reference/push) this dataset.
This is possible because of the connection that the import stage saved among
local and source projects, as explained earlier. The update downloads new and
changed files in `cats-dogs/` based on the source project, and updates the
metadata in the import stage DVC-file.