iterative · shcheklein · Dec 16, 2019 · Nov 20, 2019 · Nov 21, 2019 · Nov 21, 2019
diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json
@@ -109,7 +109,7 @@
         "slug": "sharing-data-and-model-files"
       },
       "shared-development-server",
-      "data-registry"
+      "data-registries"
     ]
   },
   {

diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md
@@ -175,7 +175,7 @@ different names, and not currently tracked by Git:
 $ git status
 ...
 Untracked files:
-  (use "git add <file>..." to include in what will be committed)
+  (use "git add <file> ..." to include in what will be committed)
 
 	model.bigrams.pkl
 	model.monograms.pkl

diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md
@@ -155,7 +155,7 @@ checkout the `6-featurization` tag:
 $ git checkout 6-featurization
 Note: checking out '6-featurization'.
 
-You are in 'detached HEAD' state.  ...
+You are in 'detached HEAD' state...
 
 $ dvc status
 
@@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference.
 $ git checkout 6-featurization
 Note: checking out '6-featurization'.
 
-You are in 'detached HEAD' state. ...
+You are in 'detached HEAD' state...
 
 HEAD is now at d13ba9a add featurization stage
 
@@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the
 
 ```dvc
 $ dvc repro evaluate.dvc
-
-... much output
+...
 To track the changes with git run:
 
     git add featurize.dvc train.dvc evaluate.dvc

diff --git a/static/docs/tutorials/deep/reproducibility.md b/static/docs/tutorials/deep/reproducibility.md
@@ -34,7 +34,7 @@ $ dvc repro model.p.dvc
 $ dvc repro
 ```
 
-Tries to reproduce the same pipeline... But there is still nothing to reproduce.
+Tries to reproduce the same pipeline, but there is still nothing to reproduce.
 
 ## Adding bigrams
 

diff --git a/static/docs/use-cases/data-registries.md b/static/docs/use-cases/data-registries.md
@@ -0,0 +1,195 @@
+# Data Registries
+
+One of the main uses of <abbr>DVC repositories</abbr> is the
+[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning),
+with commands such as `dvc add`. With the aim to enable reusability of these
+<abbr>data artifacts</abbr> between different projects, DVC also provides the
+`dvc import` and `dvc get` commands, among others. This means that a project can
+depend on data from an external <abbr>DVC project</abbr>, **similar to package
+management systems, but for data science projects**.
+
+![](/static/img/data-registry.png) _Data and models as code_
+
+Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
+tracking and versioning _datasets_ (or any large data, even ML models). This way
+we would have a repository with all the metadata and history of changes of
+different datasets. We could see who updated what, and when, and use pull
+requests to update data (the same way we do with code). This is what we call a
+**data registry**, which can work as data management _middleware_ between ML
+projects and cloud storage.
+
+> Note that a single dedicated repository is just one possible pattern to create
+> data registries with DVC.
+
+Advantages of using a DVC **data registry** project:
+
+- Data as code: Improve _lifecycle management_ with versioning of simple
+  directory structures (like Git on cloud storage), without ad-hoc conventions.
+  Leverage Git and Git hosting features such as commits, branching, pull
+  requests, reviews, and even continuous deployment of ML models.
+- Reusability: Reproduce and organize _feature stores_ with a simple CLI
+  (`dvc get` and `dvc import` commands, similar to software package management
+  systems like `pip`).
+- Persistence: The DVC registry-controlled
+  [remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves
+  data security. There are less chances someone can delete or rewrite a model,
+  for example.
+- Storage Optimization: Track data
+  [shared](/doc/use-cases/share-data-and-model-files) by multiple projects
+  centralized in a single location (with the ability to create distributed
+  copies on other remotes). This simplifies data management and optimizes space
+  requirements.
+- Security: Registries can be setup to have read-only remote storage (e.g. an
+  HTTP location). Git versioning of [DVC-files](/doc/user-guide/dvc-file-format)
+  allows us to track and audit data changes.
+
+## Building registries
+
+Data registries can be created like any other <abbr>DVC repository</abbr> with
+`git init` and `dvc init`. A good way to organize them is with different
+directories, to group the data into separate uses, such as `images/`,
+`natural-language/`, etc. For example, our
+[dataset-registry](https://github.com/iterative/dataset-registry) uses a
+directory for each section in our website documentation, like `get-started/`,
+`use-cases/`, etc.
+
+Adding datasets to a registry can be as simple as placing the data file or
+directory in question inside the <abbr>workspace</abbr>, and telling DVC to
+track it, with `dvc add`. For example:
+
+```dvc
+$ mkdir -p music/Beatles
+$ cp ~/Downloads/millionsongsubset_full music/songs
+$ dvc add music/songs
+```
+
+> This example dataset actually exists. See
+> [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset).
+
+A regular Git workflow can be followed with the tiny
+[DVC-files](/doc/user-guide/dvc-file-format) that substitute the actual data
+(`music/songs.dvc` in this example). This enables team collaboration on data at
+the same level as with source code (commit history, branching, pull requests,
+reviews, etc.):
+
+```dvc
+$ git add music/songs.dvc music/.gitignore
+$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
+```
+
+> The actual data is stored in the project's <abbr>cache</abbr> and should be
+> [pushed](/doc/command-reference/push) to one or more
+> [remote storage](/doc/command-reference/remote) locations.
+
+## Using registries
+
+The main methods to consume <abbr>data artifacts</abbr> from a **data registry**
+are the `dvc import` and `dvc get` commands, as well as the `dvc.api.open()`
+function (Python).
+
+### Simple download (get)
+
+This is analogous to using direct download tools like
+[`wget`](https://www.gnu.org/software/wget/) (HTTP),
+[`aws s3 cp`](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html) (S3),
+etc. To get a dataset for example, we can run something like:
+
+```dvc
+$ dvc get [email protected]:path/to/repository.git \
+          path/to/dataset
+```
+
+This downloads `path/to/dataset` from the <abbr>project</abbr>'s
+[default remote](/doc/command-reference/remote/default) and places it in the
+current working directory (anywhere in the file system with user write access).
+
+> Note that this command (as well as `dvc import`) has a `--rev` option to
+> download specific versions of the data.
+
+### Import workflow
+
+`dvc import` uses the same syntax as `dvc get`:
+
+```dvc
+$ dvc import [email protected]:path/to/repository.git \
+             path/to/dataset
+```
+
+> Note that unlike `dvc get`, which can be used from any directory, `dvc import`
+> needs to run within an [initialized](/doc/command-reference/init) DVC project.
+
+Besides downloading, importing saves the dependency of the local project towards
+the data source (registry repository). This is achieved by creating a particular
+kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_).
+This file can be used staged and committed with Git.
+
+As an addition to the import workflow, and enabled the saved dependency, we can
+easily bring it up to date in our consumer project with `dvc update` whenever
+the the dataset changes in the source project (data registry):
+
+```dvc
+$ dvc update dataset.dvc
+```
+
+`dvc update` downloads new and changed files, or removes deleted ones, from
+`path/to/dataset` based on the latest version of the source project. It also
+updates the project dependency metadata in the import stage (DVC-file).
+
+### Programatic reusability of DVC data
+
+Our Python API, included with the `dvc` package installed with DVC, includes the
+`open` function to load/stream data directly from remote DVC projects:
+
+```python
+import dvc.api.open
+
+model_path = 'path/to/model'
+repo_url = '[email protected]:path/to/repository.git'
+
+with dvc.api.open(model_path, repo_url) as model:
+    # Make some predictions...
+```
+
+This opens `path/to/dataset` as a file descriptor. Such a method could be used
+as a code-internal **deployment** method for ML models, for example.
+
+## Updating registries
+
+Datasets evolve, and DVC is prepared to handle it. Just change the data in the
+registry, and apply the updates by running `dvc add` again:
+
+```dvc
+$ cp /path/to/1000/image/dir music/songs
+$ dvc add music/songs
+```
+
+DVC then modifies the corresponding DVC-file to reflect the changes in the data,
+and this will be noticed by Git:
+
+```dvc
+$ git status
+Changes not staged for commit:
+...
+	modified:   music/songs.dvc
+```
+
+Iterating on this process for several datasets can give shape to a robust
+registry, which are basically repositories that mainly version a bunch of
+DVC-files, as you can see in the hypothetical example below.
+
+```dvc
+$ tree --filelimit=100
+.
+├── images
+│   ├── .gitignore
+│   ├── cats-dogs [2800 entries]  # Listed in .gitignore
+│   ├── faces [10000 entries]     # Listed in .gitignore
+│   ├── cats-dogs.dvc
+│   └── faces.dvc
+├── music
+│   ├── .gitignore
+│   ├── songs [11000 entries]     # Listed in .gitignore
+│   └── songs.dvc
+├── text
+...
+```