diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index 9dd847c917..c92333fbb3 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -109,7 +109,7 @@ "slug": "sharing-data-and-model-files" }, "shared-development-server", - "data-registry" + "data-registries" ] }, { diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index f0e37cdd64..ca2089f3b2 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -175,7 +175,7 @@ different names, and not currently tracked by Git: $ git status ... Untracked files: - (use "git add ..." to include in what will be committed) + (use "git add ..." to include in what will be committed) model.bigrams.pkl model.monograms.pkl diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md index cda7101d8b..ff2c9710a2 100644 --- a/static/docs/command-reference/install.md +++ b/static/docs/command-reference/install.md @@ -155,7 +155,7 @@ checkout the `6-featurization` tag: $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... $ dvc status @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference. $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... HEAD is now at d13ba9a add featurization stage @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the ```dvc $ dvc repro evaluate.dvc - -... much output +... To track the changes with git run: git add featurize.dvc train.dvc evaluate.dvc diff --git a/static/docs/tutorials/deep/reproducibility.md b/static/docs/tutorials/deep/reproducibility.md index 1e3ad9fcb3..25d1e7024f 100644 --- a/static/docs/tutorials/deep/reproducibility.md +++ b/static/docs/tutorials/deep/reproducibility.md @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc $ dvc repro ``` -Tries to reproduce the same pipeline... But there is still nothing to reproduce. +Tries to reproduce the same pipeline, but there is still nothing to reproduce. ## Adding bigrams diff --git a/static/docs/use-cases/data-registries.md b/static/docs/use-cases/data-registries.md new file mode 100644 index 0000000000..a061a18e9b --- /dev/null +++ b/static/docs/use-cases/data-registries.md @@ -0,0 +1,210 @@ +# Data Registries + +One of the main uses of DVC repositories is the +[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning), +with commands such as `dvc add`. With the aim to enable reusability of these +data artifacts between different projects, DVC also provides the +`dvc import` and `dvc get` commands, among others. This means that a project can +depend on data from an external DVC project, **similar to package +management systems, but for data science projects**. + +![](/static/img/data-registry.png) _Data and models as code_ + +Keeping this in mind, we could build a DVC project dedicated to +tracking and versioning _datasets_ (or any large data, even ML models). This way +we would have a repository with all the metadata and history of changes of +different datasets. We could see who updated what, and when, and use pull +requests to update data (the same way we do with code). This is what we call a +**data registry**, which can work as data management _middleware_ between ML +projects and cloud storage. + +> Note that a single dedicated repository is just one possible pattern to create +> data registries with DVC. + +Advantages of using a DVC **data registry** project: + +- Data as code: Improve _lifecycle management_ with versioning of simple + directory structures (like Git on cloud storage), without ad-hoc conventions. + Leverage Git and Git hosting features such as commits, branching, pull + requests, reviews, and even continuous deployment of ML models. +- Reusability: Reproduce and organize _feature stores_ with a simple CLI + (`dvc get` and `dvc import` commands, similar to software package management + systems like `pip`). +- Persistence: The DVC registry-controlled + [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves + data security. There are less chances someone can delete or rewrite a model, + for example. +- Storage Optimization: Track data + [shared](/doc/use-cases/share-data-and-model-files) by multiple projects + centralized in a single location (with the ability to create distributed + copies on other remotes). This simplifies data management and optimizes space + requirements. +- Security: Registries can be setup to have read-only remote storage (e.g. an + HTTP location). Git versioning of [DVC-files](/doc/user-guide/dvc-file-format) + allows us to track and audit data changes. + +## Building registries + +Data registries can be created like any other DVC repository with +`git init` and `dvc init`. A good way to organize them is with different +directories, to group the data into separate uses, such as `images/`, +`natural-language/`, etc. For example, our +[dataset-registry](https://github.com/iterative/dataset-registry) uses a +directory for each section in our website documentation, like `get-started/`, +`use-cases/`, etc. + +Adding datasets to a registry can be as simple as placing the data file or +directory in question inside the workspace, and telling DVC to +track it, with `dvc add`. For example: + +```dvc +$ mkdir -p music/Beatles +$ cp ~/Downloads/millionsongsubset_full music/songs +$ dvc add music/songs +``` + +> This example dataset actually exists. See +> [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset). + +A regular Git workflow can be followed with the tiny +[DVC-files](/doc/user-guide/dvc-file-format) that substitute the actual data +(`music/songs.dvc` in this example). This enables team collaboration on data at +the same level as with source code (commit history, branching, pull requests, +reviews, etc.): + +```dvc +$ git add music/songs.dvc music/.gitignore +$ git commit -m "Track 1.8 GB 10,000 song dataset in music/" +``` + +The actual data is stored in the project's cache and should be +[pushed](/doc/command-reference/push) to one or more +[remote storage](/doc/command-reference/remote) locations, so the registry can +be accessed from other locations or by other people: + +``` +$ dvc remote add myremote s3://bucket/path +$ dvc push +``` + +## Using registries + +The main methods to consume data artifacts from a **data registry** +are the `dvc import` and `dvc get` commands, as well as the `dvc.api` Python +API. + +### Simple download (get) + +This is analogous to using direct download tools like +[`wget`](https://www.gnu.org/software/wget/) (HTTP), +[`aws s3 cp`](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html) (S3), +etc. To get a dataset for example, we can run something like: + +```dvc +$ dvc get https://github.com/example/registry \ + music/songs/ +``` + +This downloads `music/songs/` from the project's +[default remote](/doc/command-reference/remote/default) and places it in the +current working directory (anywhere in the file system with user write access). + +> Note that this command (as well as `dvc import`) has a `--rev` option to +> download specific versions of the data. + +### Import workflow + +`dvc import` uses the same syntax as `dvc get`: + +```dvc +$ dvc import https://github.com/example/registry \ + images/faces/ +``` + +> Note that unlike `dvc get`, which can be used from any directory, `dvc import` +> needs to run within an [initialized](/doc/command-reference/init) DVC project. + +Besides downloading, importing saves the dependency of the local project towards +the data source (registry repository). This is achieved by creating a particular +kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_). +This file can be used staged and committed with Git. + +As an addition to the import workflow, and enabled the saved dependency, we can +easily bring it up to date in our consumer project with `dvc update` whenever +the the dataset changes in the source project (data registry): + +```dvc +$ dvc update dataset.dvc +``` + +`dvc update` downloads new and changed files, or removes deleted ones, from +`images/faces/`, based on the latest version of the source project. It also +updates the project dependency metadata in the import stage (DVC-file). + +### Programatic reusability of DVC data + +Our Python API, included with the `dvc` package installed with DVC, includes the +`open` function to load/stream data directly from external DVC projects: + +```python +import dvc.api.open + +model_path = 'model.pkl' +repo_url = 'https://github.com/example/registry' + +with dvc.api.open(model_path, repo_url) as fd: + model = pickle.load(fd) + # ... Use the model! +``` + +This opens `model.pkl` as a file descriptor. The example above tries to +illustrate a hardcoded ML model **deployment** method. + +## Updating registries + +Datasets evolve, and DVC is prepared to handle it. Just change the data in the +registry, and apply the updates by running `dvc add` again: + +```dvc +$ cp /path/to/1000/image/dir music/songs +$ dvc add music/songs +``` + +DVC then modifies the corresponding DVC-file to reflect the changes in the data, +and this will be noticed by Git: + +```dvc +$ git status +Changes not staged for commit: +... + modified: music/songs.dvc +$ git commit -am "Add 1,000 more songs to music/ dataset." +``` + +Iterating on this process for several datasets can give shape to a robust +registry, which are basically repositories that mainly version a bunch of +DVC-files, as you can see in the hypothetical example below. + +```dvc +$ tree --filelimit=100 +. +├── images +│ ├── .gitignore +│ ├── cats-dogs [2800 entries] # Listed in .gitignore +│ ├── faces [10000 entries] # Listed in .gitignore +│ ├── cats-dogs.dvc +│ └── faces.dvc +├── music +│ ├── .gitignore +│ ├── songs [11000 entries] # Listed in .gitignore +│ └── songs.dvc +├── text +... +``` + +And let's not forget to `dvc push` data changes to the +[remote storage](/doc/command-reference/remote), so others can obtain them! + +``` +$ dvc push +``` diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md deleted file mode 100644 index 0a5a8281c6..0000000000 --- a/static/docs/use-cases/data-registry.md +++ /dev/null @@ -1,171 +0,0 @@ -# Data Registry - -One of the main uses of DVC repositories is the -[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning). -This is provided by commands such as `dvc add` and `dvc run`, that allow -tracking of datasets and any other data artifacts. - -With the aim to enable reusability of these versioned artifacts between -different projects (similar to package management systems, but for data), DVC -also includes the `dvc get`, `dvc import`, and `dvc update` commands. For -example, project A may use a data file to begin its data -[pipeline](/doc/command-reference/pipeline), but project B also requires this -same file; Instead of -[adding it](/doc/command-reference/add#example-single-file) it to both projects, -B can simply import it from A. Furthermore, the version of the data file -imported to B can be an older iteration than what's currently used in A. - -Keeping this in mind, we could build a DVC project dedicated to -tracking and versioning datasets (or any kind of large files). This way we would -have a repository that has all the metadata and change history for the project's -data. We can see who updated what, and when; use pull requests to update data -the same way you do with code; and we don't need ad-hoc conventions to store -different data versions. Other projects can share the data in the registry by -downloading (`dvc get`) or importing (`dvc import`) them for use in different -data processes. - -The advantages of using a DVC **data registry** project are: - -- Reusability: Reproduce and organize _feature stores_ with a simple CLI - (`dvc get` and `dvc import` commands, similar to software package management - systems like `pip`). -- Persistence: The DVC registry-controlled - [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves - data security. There are less chances someone can delete or rewrite a model, - for example. -- Storage Optimization: Track data - [shared](/doc/use-cases/share-data-and-model-files) by multiple projects - centralized in a single location (with the ability to create distributed - copies on other remotes). This simplifies data management and optimizes space - requirements. -- Security: Registries can be setup to have read-only remote storage (e.g. an - HTTP location). Git versioning of DVC-files allows us to track and audit data - changes. -- Data as code: Improve _lifecycle management_ with versioning of simple - directory structures (like Git for your cloud storage), without ad-hoc - conventions. Leverage Git and Git hosting features such as change history, - branching, pull requests, reviews, and even continuous deployment of ML - models. - - -## Example - -A dataset we use for several of our examples and tutorials is one containing -2800 images of cats and dogs. We partitioned the dataset in two for our -[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a -storage server, downloading them with `wget` in our examples. This setup was -then revised to download the dataset with `dvc get` instead, so we created the -[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a -DVC project hosted on GitHub, to version the dataset (see its -[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) -directory). - -However, there are a few problems with the way this dataset is structured. Most -importantly, this single dataset is tracked by 2 different -[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same -one, which would better reflect the intentions of this dataset... Fortunately, -we have also prepared an improved alternative in the -[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) -directory of the same DVC repository. - -To create a -[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) -of our dataset, we extracted the first part into the `use-cases/cats-dogs` -directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to -[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). - -```dvc -$ tree use-cases/cats-dogs --filelimit 3 -use-cases/cats-dogs -└── data - ├── train - │   ├── cats [500 image files] - │   └── dogs [500 image files] - └── validation - ├── cats [400 image files] - └── dogs [400 image files] -``` - -In a local DVC project, we could have downloaded this dataset at this point with -the following command: - -```dvc -$ dvc import git@github.com:iterative/dataset-registry.git \ - use-cases/cats-dogs -``` - -> Note that unlike `dvc get`, which can be used from any directory, `dvc import` -> always needs to run from an [initialized](/doc/command-reference/init) DVC -> project. - -
- -### Expand for actionable command (optional) - -The command above is meant for informational purposes only. If you actually run -it in a DVC project, although it should work, it will import the latest version -of `use-cases/cats-dogs` from `dataset-registry`. The following command would -actually bring in the version in question: - -```dvc -$ dvc import --rev cats-dogs-v1 \ - git@github.com:iterative/dataset-registry.git \ - use-cases/cats-dogs -``` - -See the `dvc import` command reference for more details on the `--rev` -(revision) option. - -
- -Importing keeps the connection between the local project and the source data -registry where we are downloading the dataset from. This is achieved by creating -a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the -`repo` field (a.k.a. _import stage_). (This file can be used for versioning the -import with Git.) - -> For a sample DVC-file resulting from `dvc import`, refer to -> [this example](/doc/command-reference/import#example-data-registry). - -Back in our **dataset-registry** project, a -[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) -of our dataset was created by extracting the second part, with 1000 additional -images (500 cats, 500 dogs), into the same directory structure. Then, we simply -ran `dvc add use-cases/cats-dogs` again. - -In our local project, all we have to do in order to obtain this latest version -of the dataset is to run: - -```dvc -$ dvc update cats-dogs.dvc -``` - -This is possible because of the connection that the import stage saved among -local and source projects, as explained earlier. - -
- -### Expand for actionable command (optional) - -As with the previous hidden note, actually trying the commands above should -produced the expected results, but not for obvious reasons. Specifically, the -initial `dvc import` command would have already obtained the latest version of -the dataset (as noted before), so this `dvc update` is unnecessary and won't -have an effect. - -If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import -stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to -update it, do not use `dvc update`. Instead, re-import the data by using the -original import command (without `--rev`). Refer to -[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) -for more information. - -
- -This downloads new and changed files in `cats-dogs/` from the source project, -and updates the metadata in the import stage DVC-file. - -As an extra detail, notice that so far our local project is working only with a -local cache. It has no need to setup a -[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull) -or [push](/doc/command-reference/push) this dataset. diff --git a/static/img/data-registry.png b/static/img/data-registry.png new file mode 100644 index 0000000000..e254b0175b Binary files /dev/null and b/static/img/data-registry.png differ