-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start: Data Management Trail #2894
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
--- | ||
title: Data and Model Management Trail | ||
--- | ||
|
||
As its name implies, DVC is used to control versions of data. It enables to keep | ||
track of multiple versions of your datasets. | ||
|
||
## Initialize a DVC project | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @iesahin please, let's not for now do any substantial large changes to the existing data management trail. Let's keep the previous project, keep the structure and wrap it up in the section (trail). Maybe remove some parts - like experiments. It's not the largest priority to rewrite it at the moment to use MNIST or include stuff like remove/gc (which can be even too much for get started to my mind) I would rather focus on expanding experiments trails with the next steps - metrics, etc. Connecting trails properly, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. That's fine with me. Before seeing this comment, I began to write #2919 as a replacement for I thought our initial decision was to create projects suitable for each of these trails. What's the scope of changes in your mind? @shcheklein @jorgeorpinel There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes, and I'm not opposed to this. I would just try to go from some simple steps - like wrap up the existing project into the trail, move metrics properly to the experiments (or keep them here as well - I'm fine with that either, wrap up the experiments trail. I wish we can try to keep two projects at most - deep learning (experiments, checkpoints, live metrics) and pipelines (nlp / some data processing is a better fit here probably). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe we can get away with a single project mostly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, that would be fine. But let's first keep it as is as much as possible in terms of content/projects. Just rename/move existing sections under the "Data (and Model?) Management Trail", and keep iterating on the experiments for now. It doesn't look that data management lacks any content or needs any immediate rewrite to be honest. |
||
|
||
Suppose we are working on a deep learning project to develop the next ground | ||
breaking supervised learning model. We plan to test the classifier in MNIST | ||
dataset, but we also plan to use a more difficult one, Fashion MNIST. We want to | ||
keep track of these two datasets and replace with each other easily, without | ||
changes in the code. | ||
|
||
We need a way to track these two datasets as if they are versions of the same | ||
text file. DVC is used for this kind of data and model file tracking as if they | ||
are code files. | ||
|
||
#### ✍🏻 We download MNIST data from a URL using wget/curl | ||
|
||
We first download the example project from a point where only the source code | ||
files are present. | ||
|
||
```dvc | ||
$ git clone https://github.com/iterative/example-data-management -b get-started | ||
$ cd example-data-management | ||
``` | ||
|
||
The data is not provided with the source code and we need to download it | ||
separately. | ||
|
||
```dvc | ||
$ wget https://dvc.org/datasets/mnist.zip -o data/mnist.zip | ||
``` | ||
|
||
> Later on, we'll see how to automate this procedure and how DVC can track the | ||
> data along with the code. We're just starting journey. | ||
|
||
## Adding data to DVC projects | ||
|
||
We add data and model files (and directories) to DVC with `dvc add` command. | ||
|
||
```dvc | ||
$ dvc add data/mnist.zip | ||
``` | ||
|
||
DVC stores information about the added file (or a directory) in a special `.dvc` | ||
file named `data/mnist.zip.dvc` a small text file with a human-readable | ||
[format](/doc/user-guide/project-structure/dvc-files). This metadata file is a | ||
placeholder for the original data, and can be easily versioned like source code | ||
with Git: | ||
|
||
```dvc | ||
$ git add data/mnist.zip.dvc data/.gitignore | ||
$ git commit -m "Add zipped MNIST data" | ||
``` | ||
|
||
The original data, meanwhile, is listed in `.gitignore`. | ||
|
||
## Versioning data in DVC projects | ||
|
||
Suppose you have run the [experiments] with MNIST and would like to see your | ||
model's performance in another dataset. You can update the code to use a | ||
different dataset but here, in order to demonstrate how DVC makes it easy to | ||
update the data, we'll write Fashion-MNIST over MNIST. | ||
|
||
```dvc | ||
$ wget https://dvc.org/datasets/fashion-mnist.zip --force -o data/mnist.zip | ||
``` | ||
|
||
Now, when we ask DVC about the changes in the workspace, it tells us that | ||
`mnist.zip` has changed. | ||
|
||
```dvc | ||
$ dvc status | ||
``` | ||
|
||
And we can add the newer version to DVC as well. | ||
|
||
```dvc | ||
$ dvc add data/mnist.zip | ||
$ git add data/mnist.zip.dvc | ||
$ git commit -m "Added Fashion MNIST dataset" | ||
``` | ||
|
||
Now you have two different datasets in your cache, and you can switch between | ||
them as if they are code files in a Git repository. | ||
|
||
```dvc | ||
$ git checkout HEAD~1 | ||
$ dvc checkout | ||
``` | ||
|
||
Note that you can also keep these different version in separate Git branches or | ||
tags. Their content is saved in `.dvc/cache` in the project root and only a | ||
reference in the form of `.dvc` file is kept in Git. | ||
|
||
Yes, DVC is technically not even a version control system! `.dvc` file contents | ||
define data file versions. Git itself provides the version control. DVC in turn | ||
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in | ||
the <abbr>workspace</abbr> efficiently to match them. | ||
|
||
<details> | ||
|
||
### ℹ️ Large datasets versioning | ||
|
||
In cases where you process very large datasets, you need an efficient mechanism | ||
(in terms of space and performance) to share a lot of data, including different | ||
versions. Do you use network attached storage (NAS)? Or a large external volume? | ||
You can learn more about advanced workflows using these links: | ||
|
||
- A [shared cache](/doc/user-guide/how-to/share-a-dvc-cache) can be set up to | ||
store, version and access a lot of data on a large shared volume efficiently. | ||
- A quite advanced scenario is to track and version data directly on the remote | ||
storage (e.g. S3). See | ||
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data) | ||
to learn more. | ||
|
||
</details> | ||
|
||
## Sharing data and models | ||
|
||
You can upload DVC-tracked data or model files with `dvc push`, so they're | ||
safely stored [remotely](/doc/command-reference/remote). This also means they | ||
can be retrieved on other environments later with `dvc pull`. First, we need to | ||
set up a remote storage location: | ||
|
||
```dvc | ||
$ dvc remote add -d storage s3://mybucket/dvcstore | ||
$ git add .dvc/config | ||
$ git commit -m "Configure remote storage" | ||
``` | ||
|
||
> DVC supports many remote storage types, including Amazon S3, SSH, Google | ||
> Drive, Azure Blob Storage, and HDFS. See `dvc remote add` for more details and | ||
> examples. | ||
|
||
<details> | ||
|
||
### ⚙️ Expand to set up remote storage. | ||
|
||
DVC remotes let you store a copy of the data tracked by DVC outside of the local | ||
cache (usually a cloud storage service). For simplicity, let's set up a _local | ||
remote_: | ||
|
||
```dvc | ||
$ mkdir -p /tmp/dvcstore | ||
$ dvc remote add -d myremote /tmp/dvcstore | ||
$ git commit .dvc/config -m "Configure local remote" | ||
``` | ||
|
||
> While the term "local remote" may seem contradictory, it doesn't have to be. | ||
> The "local" part refers to the type of location: another directory in the file | ||
> system. "Remote" is what we call storage for <abbr>DVC projects</abbr>. It's | ||
> essentially a local data backup. | ||
|
||
</details> | ||
|
||
```dvc | ||
$ dvc push | ||
``` | ||
|
||
Usually, we also want to `git commit` and `git push` the corresponding `.dvc` | ||
files. | ||
|
||
## Pushing to/pulling from remotes | ||
|
||
To demonstrate how we share the data files with DVC, let's clone the project to | ||
a local directory. | ||
|
||
```dvc | ||
$ cd .. | ||
$ git clone example-data-management example-data-management-clone | ||
``` | ||
|
||
You can see that the clone doesn't contain the data files by checking the size | ||
of both: | ||
|
||
```dvc | ||
$ du -hs example-data-management | ||
$ du -hs example-data-management-clone | ||
``` | ||
|
||
Now, we'll get the data files from the remote we configured earlier with a | ||
single command. | ||
|
||
```dvc | ||
$ cd example-data-management-clone | ||
$ dvc pull | ||
``` | ||
|
||
> Note that `dvc pull` downloads only the files necessary in the workspace. To | ||
> obtain all the cache files from all commits, use `--all-commits` flag. | ||
|
||
```dvc | ||
$ dvc pull | ||
``` | ||
|
||
This is how we share the data and model files as attached to the repository. | ||
Another person can just clone the Git repository and (if they have valid | ||
credentials to access the DVC remote), `dvc pull` the files required to run the | ||
project. | ||
|
||
> 📖 See also | ||
> [Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) | ||
> for more on basic collaboration workflows. | ||
|
||
## Accessing public datasets and registries | ||
|
||
Earlier, we got the data files with `wget` and downloaded to a directory in the | ||
repository. DVC also provides an easier way to access the data files by the Git | ||
repository they belong to. | ||
|
||
For example, instead of downloading the dataset from a web URL, we can use its | ||
reference in a Git repository: | ||
|
||
```dvc | ||
$ dvc get https://github.com/iterative/dataset-registry \ | ||
mnist/mnist.zip -o data/mnist.zip | ||
``` | ||
|
||
If you observe the text file at | ||
https://github.com/iterative/dataset-registry/mnist/mnist.zip.dvc, you'll see | ||
that it's identical to the files `dvc add` produces. Similarly, you can publish | ||
your dataset and models in Github by configuring a public DVC remote for them, | ||
for anyone to access your work via the repository. | ||
|
||
Read on or watch our video to see how to find and access models and datasets | ||
with DVC. | ||
|
||
https://youtu.be/EE7Gk84OZY8 | ||
|
||
### Find a file or directory | ||
|
||
You can use `dvc list` to explore a <abbr>DVC repository</abbr> hosted on any | ||
Git server. For example, let's see what's in the `mnist/` directory of our | ||
[dataset-registry](https://github.com/iterative/dataset-registry) repo: | ||
|
||
```dvc | ||
$ dvc list https://github.com/iterative/dataset-registry | ||
mnist/ | ||
``` | ||
|
||
The benefit of this command over browsing a Git hosting website is that the list | ||
includes files and directories tracked by both Git and DVC (`mnist.zip` is not | ||
visible if you | ||
[check GitHub](https://github.com/iterative/dataset-registry/tree/master/mnist)). | ||
|
||
## Track the data and models automatically | ||
|
||
As `dvc get` can download the contents from a DVC repository, `dvc import` can | ||
also download any file or directory, while also creating a `.dvc` file that | ||
tracks the contents _from URL._ | ||
|
||
```dvc | ||
$ dvc import https://github.com/iterative/dataset-registry \ | ||
mnist/mnist.zip -o data/mnist.zip | ||
``` | ||
|
||
This is similar to `dvc get` + `dvc add`, but the resulting `.dvc` files | ||
includes metadata to track changes in the source repository. This allows you to | ||
bring in changes from the data source later using `dvc update`. | ||
|
||
`.dvc` files created by `dvc import` have special fields, such as the data | ||
source `repo` and `path`, hence when the source changes DVC can follow the | ||
origin and update the local datasets. | ||
|
||
## Removing data from DVC projects | ||
|
||
- Remove certain folders from workspace | ||
- Delete the corresponding cache files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model Management seems like a very different thing e.g. https://www.dominodatalab.com/solutions/model-management/
I'd say either keep it simple with "Data Management" (well known and understood term) or use another word like "Artifact".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current titles include "Model" as well. We consider models as another kind of file. The link you shared adds some more stuff to it, but most of those aspects of model management are covered in
dvc exp show
or deployment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the phrase "model management" is in that title and by itself has a different meaning, which may be confusing for readers and search engines.