Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: Data Management Trail #2894

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,9 @@
},
"children": [
{
"slug": "data-and-model-versioning",
"slug": "data-and-model-management",
"tutorials": {
"katacoda": "https://katacoda.com/dvc/courses/get-started/versioning"
}
},
{
"slug": "data-and-model-access",
"tutorials": {
"katacoda": "https://katacoda.com/dvc/courses/get-started/accessing"
"katacoda": "https://katacoda.com/dvc/courses/get-started/data-management"
}
},
{
Expand Down
272 changes: 272 additions & 0 deletions content/docs/start/data-and-model-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
---
title: Data and Model Management Trail
Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Management seems like a very different thing e.g. https://www.dominodatalab.com/solutions/model-management/

I'd say either keep it simple with "Data Management" (well known and understood term) or use another word like "Artifact".

Copy link
Contributor Author

@iesahin iesahin Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current titles include "Model" as well. We consider models as another kind of file. The link you shared adds some more stuff to it, but most of those aspects of model management are covered in dvc exp show or deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the phrase "model management" is in that title and by itself has a different meaning, which may be confusing for readers and search engines.

---

As its name implies, DVC is used to control versions of data. It enables to keep
track of multiple versions of your datasets.

## Initialize a DVC project
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iesahin please, let's not for now do any substantial large changes to the existing data management trail. Let's keep the previous project, keep the structure and wrap it up in the section (trail). Maybe remove some parts - like experiments.

It's not the largest priority to rewrite it at the moment to use MNIST or include stuff like remove/gc (which can be even too much for get started to my mind)

I would rather focus on expanding experiments trails with the next steps - metrics, etc. Connecting trails properly, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That's fine with me.

Before seeing this comment, I began to write #2919 as a replacement for data-pipelines and it updates the underlying project as well. Should I revert it as well?

I thought our initial decision was to create projects suitable for each of these trails.

What's the scope of changes in your mind? @shcheklein @jorgeorpinel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought our initial decision was to create projects suitable for each of these trails.

yes, and I'm not opposed to this. I would just try to go from some simple steps - like wrap up the existing project into the trail, move metrics properly to the experiments (or keep them here as well - I'm fine with that either, wrap up the experiments trail.

I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably).

I believe we can get away with a single project mostly. example-dvc-experiments already has a 2-stage pipeline suitable for telling the pipelines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be fine. But let's first keep it as is as much as possible in terms of content/projects. Just rename/move existing sections under the "Data (and Model?) Management Trail", and keep iterating on the experiments for now. It doesn't look that data management lacks any content or needs any immediate rewrite to be honest.


Suppose we are working on a deep learning project to develop the next ground
breaking supervised learning model. We plan to test the classifier in MNIST
dataset, but we also plan to use a more difficult one, Fashion MNIST. We want to
keep track of these two datasets and replace with each other easily, without
changes in the code.

We need a way to track these two datasets as if they are versions of the same
text file. DVC is used for this kind of data and model file tracking as if they
are code files.

#### ✍🏻 We download MNIST data from a URL using wget/curl

We first download the example project from a point where only the source code
files are present.

```dvc
$ git clone https://github.com/iterative/example-data-management -b get-started
$ cd example-data-management
```

The data is not provided with the source code and we need to download it
separately.

```dvc
$ wget https://dvc.org/datasets/mnist.zip -o data/mnist.zip
```

> Later on, we'll see how to automate this procedure and how DVC can track the
> data along with the code. We're just starting journey.

## Adding data to DVC projects

We add data and model files (and directories) to DVC with `dvc add` command.

```dvc
$ dvc add data/mnist.zip
```

DVC stores information about the added file (or a directory) in a special `.dvc`
file named `data/mnist.zip.dvc` a small text file with a human-readable
[format](/doc/user-guide/project-structure/dvc-files). This metadata file is a
placeholder for the original data, and can be easily versioned like source code
with Git:

```dvc
$ git add data/mnist.zip.dvc data/.gitignore
$ git commit -m "Add zipped MNIST data"
```

The original data, meanwhile, is listed in `.gitignore`.

## Versioning data in DVC projects

Suppose you have run the [experiments] with MNIST and would like to see your
model's performance in another dataset. You can update the code to use a
different dataset but here, in order to demonstrate how DVC makes it easy to
update the data, we'll write Fashion-MNIST over MNIST.

```dvc
$ wget https://dvc.org/datasets/fashion-mnist.zip --force -o data/mnist.zip
```

Now, when we ask DVC about the changes in the workspace, it tells us that
`mnist.zip` has changed.

```dvc
$ dvc status
```

And we can add the newer version to DVC as well.

```dvc
$ dvc add data/mnist.zip
$ git add data/mnist.zip.dvc
$ git commit -m "Added Fashion MNIST dataset"
```

Now you have two different datasets in your cache, and you can switch between
them as if they are code files in a Git repository.

```dvc
$ git checkout HEAD~1
$ dvc checkout
```

Note that you can also keep these different version in separate Git branches or
tags. Their content is saved in `.dvc/cache` in the project root and only a
reference in the form of `.dvc` file is kept in Git.

Yes, DVC is technically not even a version control system! `.dvc` file contents
define data file versions. Git itself provides the version control. DVC in turn
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
the <abbr>workspace</abbr> efficiently to match them.

<details>

### ℹ️ Large datasets versioning

In cases where you process very large datasets, you need an efficient mechanism
(in terms of space and performance) to share a lot of data, including different
versions. Do you use network attached storage (NAS)? Or a large external volume?
You can learn more about advanced workflows using these links:

- A [shared cache](/doc/user-guide/how-to/share-a-dvc-cache) can be set up to
store, version and access a lot of data on a large shared volume efficiently.
- A quite advanced scenario is to track and version data directly on the remote
storage (e.g. S3). See
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data)
to learn more.

</details>

## Sharing data and models

You can upload DVC-tracked data or model files with `dvc push`, so they're
safely stored [remotely](/doc/command-reference/remote). This also means they
can be retrieved on other environments later with `dvc pull`. First, we need to
set up a remote storage location:

```dvc
$ dvc remote add -d storage s3://mybucket/dvcstore
$ git add .dvc/config
$ git commit -m "Configure remote storage"
```

> DVC supports many remote storage types, including Amazon S3, SSH, Google
> Drive, Azure Blob Storage, and HDFS. See `dvc remote add` for more details and
> examples.

<details>

### ⚙️ Expand to set up remote storage.

DVC remotes let you store a copy of the data tracked by DVC outside of the local
cache (usually a cloud storage service). For simplicity, let's set up a _local
remote_:

```dvc
$ mkdir -p /tmp/dvcstore
$ dvc remote add -d myremote /tmp/dvcstore
$ git commit .dvc/config -m "Configure local remote"
```

> While the term "local remote" may seem contradictory, it doesn't have to be.
> The "local" part refers to the type of location: another directory in the file
> system. "Remote" is what we call storage for <abbr>DVC projects</abbr>. It's
> essentially a local data backup.

</details>

```dvc
$ dvc push
```

Usually, we also want to `git commit` and `git push` the corresponding `.dvc`
files.

## Pushing to/pulling from remotes

To demonstrate how we share the data files with DVC, let's clone the project to
a local directory.

```dvc
$ cd ..
$ git clone example-data-management example-data-management-clone
```

You can see that the clone doesn't contain the data files by checking the size
of both:

```dvc
$ du -hs example-data-management
$ du -hs example-data-management-clone
```

Now, we'll get the data files from the remote we configured earlier with a
single command.

```dvc
$ cd example-data-management-clone
$ dvc pull
```

> Note that `dvc pull` downloads only the files necessary in the workspace. To
> obtain all the cache files from all commits, use `--all-commits` flag.

```dvc
$ dvc pull
```

This is how we share the data and model files as attached to the repository.
Another person can just clone the Git repository and (if they have valid
credentials to access the DVC remote), `dvc pull` the files required to run the
project.

> 📖 See also
> [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files)
> for more on basic collaboration workflows.

## Accessing public datasets and registries

Earlier, we got the data files with `wget` and downloaded to a directory in the
repository. DVC also provides an easier way to access the data files by the Git
repository they belong to.

For example, instead of downloading the dataset from a web URL, we can use its
reference in a Git repository:

```dvc
$ dvc get https://github.com/iterative/dataset-registry \
mnist/mnist.zip -o data/mnist.zip
```

If you observe the text file at
https://github.com/iterative/dataset-registry/mnist/mnist.zip.dvc, you'll see
that it's identical to the files `dvc add` produces. Similarly, you can publish
your dataset and models in Github by configuring a public DVC remote for them,
for anyone to access your work via the repository.

Read on or watch our video to see how to find and access models and datasets
with DVC.

https://youtu.be/EE7Gk84OZY8

### Find a file or directory

You can use `dvc list` to explore a <abbr>DVC repository</abbr> hosted on any
Git server. For example, let's see what's in the `mnist/` directory of our
[dataset-registry](https://github.com/iterative/dataset-registry) repo:

```dvc
$ dvc list https://github.com/iterative/dataset-registry
mnist/
```

The benefit of this command over browsing a Git hosting website is that the list
includes files and directories tracked by both Git and DVC (`mnist.zip` is not
visible if you
[check GitHub](https://github.com/iterative/dataset-registry/tree/master/mnist)).

## Track the data and models automatically

As `dvc get` can download the contents from a DVC repository, `dvc import` can
also download any file or directory, while also creating a `.dvc` file that
tracks the contents _from URL._

```dvc
$ dvc import https://github.com/iterative/dataset-registry \
mnist/mnist.zip -o data/mnist.zip
```

This is similar to `dvc get` + `dvc add`, but the resulting `.dvc` files
includes metadata to track changes in the source repository. This allows you to
bring in changes from the data source later using `dvc update`.

`.dvc` files created by `dvc import` have special fields, such as the data
source `repo` and `path`, hence when the source changes DVC can follow the
origin and update the local datasets.

## Removing data from DVC projects

- Remove certain folders from workspace
- Delete the corresponding cache files