From c31d9713cbc2ffd2d4903db8a23c902fd18393c7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 19 Nov 2019 18:59:06 -0600 Subject: [PATCH 1/8] use-cases: address smaller points from review (#795) --- static/docs/use-cases/data-registry.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index a5eead5b21..937c6e9d72 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -13,13 +13,13 @@ example, project A may use a data file to begin its data same file; Instead of [adding it](/doc/command-reference/add#example-single-file) it to both projects, B can simply import it from A. Furthermore, the version of the data file -imported to B can be an older iteration than what's currently used in A. +imported to B can be different than what's currently used in A. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would -have a repository that has all the metadata and change history for the project's -data. We can see who updated what, and when; use pull requests to update data -the same way you do with code; and we don't need ad-hoc conventions to store +have a repository with all the metadata and history of changes in the project's +data. We can see who updated what, and when, use pull requests to update data +the same way you do with code, and we don't need ad-hoc conventions to store different data versions. Other projects can share the data in the registry by downloading (`dvc get`) or importing (`dvc import`) them for use in different data processes. @@ -28,9 +28,8 @@ The advantages of using a DVC **data registry** project are: - Data as code: Improve _lifecycle management_ with versioning of simple directory structures (like Git for your cloud storage), without ad-hoc - conventions. Leverage Git and Git hosting features such as change history, - branching, pull requests, reviews, and even continuous deployment of ML - models. + conventions. Leverage Git and Git hosting features such as commits, branching, + pull requests, reviews, and even continuous deployment of ML models. - Reusability: Reproduce and organize _feature stores_ with a simple CLI (`dvc get` and `dvc import` commands, similar to software package management systems like `pip`). @@ -49,8 +48,8 @@ The advantages of using a DVC **data registry** project are: ## Example -A dataset we use for several of our examples and tutorials is one containing -2800 images of cats and dogs. We partitioned the dataset in two for our +A dataset we use for several of our examples and tutorials contains 2800 images +of cats and dogs. We partitioned the dataset in two for our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a storage server, downloading them with `wget` in our examples. This setup was then revised to download the dataset with `dvc get` instead, so we created the From 6002cba2d1e166cd1b628212382531340db6a396 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 20 Nov 2019 18:25:05 -0600 Subject: [PATCH 2/8] use-cases: reinforce hypothetical phrasing in data registry intro paragraph per https://github.com/iterative/dvc.org/issues/795#issuecomment-556114361 --- static/docs/use-cases/data-registry.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 937c6e9d72..eccaeedb15 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -10,7 +10,7 @@ different projects (similar to package management systems, but for data), DVC also includes the `dvc get`, `dvc import`, and `dvc update` commands. For example, project A may use a data file to begin its data [pipeline](/doc/command-reference/pipeline), but project B also requires this -same file; Instead of +same file. Instead of [adding it](/doc/command-reference/add#example-single-file) it to both projects, B can simply import it from A. Furthermore, the version of the data file imported to B can be different than what's currently used in A. @@ -18,13 +18,13 @@ imported to B can be different than what's currently used in A. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would have a repository with all the metadata and history of changes in the project's -data. We can see who updated what, and when, use pull requests to update data -the same way you do with code, and we don't need ad-hoc conventions to store -different data versions. Other projects can share the data in the registry by -downloading (`dvc get`) or importing (`dvc import`) them for use in different -data processes. +data. We could see who updated what, and when, use pull requests to update data +(the same way we do with code), and avoid ad-hoc conventions to store different +data versions. This is what we call a data registry. Other projects can share +datasets in a registry by downloading (`dvc get`) or importing (`dvc import`) +them for use in different data processes. -The advantages of using a DVC **data registry** project are: +Advantages of using a DVC **data registry** project: - Data as code: Improve _lifecycle management_ with versioning of simple directory structures (like Git for your cloud storage), without ad-hoc From 47ebae5868f88b11b6fda55b70a7b6df48b6c9d9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 20 Nov 2019 18:50:45 -0600 Subject: [PATCH 3/8] use-cases: partitioned->split in data registry case per #795 and https://github.com/iterative/dvc.org/issues/795#issuecomment-556114361 --- static/docs/use-cases/data-registry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index eccaeedb15..adcc0a7990 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -49,7 +49,7 @@ Advantages of using a DVC **data registry** project: ## Example A dataset we use for several of our examples and tutorials contains 2800 images -of cats and dogs. We partitioned the dataset in two for our +of cats and dogs. We split the dataset in two for our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a storage server, downloading them with `wget` in our examples. This setup was then revised to download the dataset with `dvc get` instead, so we created the From a578c15d58384a25ac85fb9e1fa6c5b6f163e521 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 20 Nov 2019 18:56:50 -0600 Subject: [PATCH 4/8] use-cases: geatly simplify mention about project inter-dependency in data reg per https://github.com/iterative/dvc.org/issues/795#issuecomment-556114361 and https://github.com/iterative/dvc.org/issues/795#issuecomment-556651871 --- static/docs/use-cases/data-registry.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index adcc0a7990..45fc308360 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -7,13 +7,9 @@ tracking of datasets and any other data artifacts. With the aim to enable reusability of these versioned artifacts between different projects (similar to package management systems, but for data), DVC -also includes the `dvc get`, `dvc import`, and `dvc update` commands. For -example, project A may use a data file to begin its data -[pipeline](/doc/command-reference/pipeline), but project B also requires this -same file. Instead of -[adding it](/doc/command-reference/add#example-single-file) it to both projects, -B can simply import it from A. Furthermore, the version of the data file -imported to B can be different than what's currently used in A. +also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means +that a project can depend on data from an external DVC project, but +chaining several projects this way can easily become messy... Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would From d9ad1ab2fb60e26fb2fdf6f51f5a6040b335cc2f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 21 Nov 2019 19:00:11 -0600 Subject: [PATCH 5/8] use-cases: improve intro to example in data registry case --- static/docs/use-cases/data-registry.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 45fc308360..cb8a07f0f3 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -44,17 +44,17 @@ Advantages of using a DVC **data registry** project: ## Example -A dataset we use for several of our examples and tutorials contains 2800 images -of cats and dogs. We split the dataset in two for our -[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a -storage server, downloading them with `wget` in our examples. This setup was -then revised to download the dataset with `dvc get` instead, so we created the +A dataset we commonly use for several of our examples and tutorials contains +2800 images of cats and dogs. We split it in two for our +[Versioning Tutorial](/doc/tutorials/versioning). Originally, the parts were +backed up on a storage server, and downloaded with `wget`. This setup was then +revised to download the dataset sing `dvc get` instead, so we created the [dataset-registry](https://github.com/iterative/dataset-registry)) repository, a DVC project hosted on GitHub, to version the dataset (see its [`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) directory). -However, there are a few problems with the way this dataset is structured. Most +However, there are a few problems with the way that dataset is structured. Most importantly, this single dataset is tracked by 2 different [DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same one, which would better reflect the intentions of this dataset... Fortunately, From 50b772ea806d078e974b7144bc87419db0a498e1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 23 Nov 2019 00:09:24 -0600 Subject: [PATCH 6/8] use-cases: rephrase much of the data registry example to improve its logic and readability per https://github.com/iterative/dvc.org/issues/795#issuecomment-557228299 --- static/docs/use-cases/data-registry.md | 101 +++++++++++++------------ 1 file changed, 52 insertions(+), 49 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index cb8a07f0f3..def518eb38 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -45,28 +45,29 @@ Advantages of using a DVC **data registry** project: ## Example A dataset we commonly use for several of our examples and tutorials contains -2800 images of cats and dogs. We split it in two for our +2800 images of cats and dogs, which was split it in two for our [Versioning Tutorial](/doc/tutorials/versioning). Originally, the parts were -backed up on a storage server, and downloaded with `wget`. This setup was then -revised to download the dataset sing `dvc get` instead, so we created the -[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a -DVC project hosted on GitHub, to version the dataset (see its +backed up on a storage server, and downloaded with +[`wget`](https://www.gnu.org/software/wget/). This was then revised in order to +download the parts with `dvc get` instead, so we created the +[dataset-registry](https://github.com/iterative/dataset-registry) +project to version the dataset (in the [`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) directory). -However, there are a few problems with the way that dataset is structured. Most -importantly, this single dataset is tracked by 2 different -[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same -one, which would better reflect the intentions of this dataset... Fortunately, -we have also prepared an improved alternative in the +However, there's a few problems with the way that dataset is versioned. Most +importantly, this split dataset is tracked by 2 different +[DVC-files](/doc/user-guide/dvc-file-format) (one for each part), instead of 2 +versions of a single DVC-file. An initial version could have the first part +only, while an update would have the entire, unified dataset. Fortunately, we +have also prepared this improved alternative in the [`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) directory of the same DVC repository. -To create a -[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) +To create the +[initial version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) of our dataset, we extracted the first part into the `use-cases/cats-dogs` -directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to -[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). +directory, illustrated below: ```dvc $ tree use-cases/cats-dogs --filelimit 3 @@ -80,7 +81,10 @@ use-cases/cats-dogs └── dogs [400 image files] ``` -In a local DVC project, we could have obtained this dataset at this point with +Then we ran `dvc add use-cases/cats-dogs` to +[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). + +At this point, we could have obtained this dataset in another DVC project with the following command: ```dvc @@ -90,15 +94,16 @@ $ dvc import git@github.com:iterative/dataset-registry.git \ > Note that unlike `dvc get`, which can be used from any directory, `dvc import` > always needs to run from an [initialized](/doc/command-reference/init) DVC -> project. +> project. Remember also that with both commands, the data comes from the source +> project's remote storage, not from the Git repository itself.
### Expand for actionable command (optional) The command above is meant for informational purposes only. If you actually run -it in a DVC project, although it should work, it will import the latest version -of `use-cases/cats-dogs` from `dataset-registry`. The following command would +it, although it will work, it will import the latest version of +`use-cases/cats-dogs` from `dataset-registry`. The following command would actually bring in the version in question: ```dvc @@ -112,54 +117,52 @@ See the `dvc import` command reference for more details on the `--rev`
-Importing keeps the connection between the local project and the source data -registry where we are downloading the dataset from. This is achieved by creating -a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the -`repo` field (a.k.a. _import stage_). (This file can be used for versioning the -import with Git.) +Importing keeps the connection between the local project and the +data source (registry repository). This is achieved by creating a +particular kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import +stage_) that includes a `repo` field. (This file can be used staged and +committed with Git.) > For a sample DVC-file resulting from `dvc import`, refer to > [this example](/doc/command-reference/import#example-data-registry). -Back in our **dataset-registry** project, a +Back in our **dataset-registry** project, the [second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) of our dataset was created by extracting the second part, with 1000 additional -images (500 cats, 500 dogs), into the same directory structure. Then, we simply -ran `dvc add use-cases/cats-dogs` again. +images (500 cats, 500 dogs) on top of the existing directory structure. Then, we +simply ran `dvc add use-cases/cats-dogs` again. -In our local project, all we have to do in order to obtain this latest version -of the dataset is to run: +All we would have to do in order to obtain this latest version in another +project where the first version was previously imported, is to run: ```dvc $ dvc update cats-dogs.dvc ``` -This is possible because of the connection that the import stage saved among -local and source projects, as explained earlier. -
### Expand for actionable command (optional) -As with the previous hidden note, actually trying the commands above should -produced the expected results, but not for obvious reasons. Specifically, the -initial `dvc import` command would have already obtained the latest version of -the dataset (as noted before), so this `dvc update` is unnecessary and won't -have an effect. +As with the previous hidden note, actually trying the command above will produce +the desired results, but not for obvious reasons. The initial `dvc import` +command would have already obtained the latest version of the dataset (as noted +before), so this `dvc update` is unnecessary and won't have any effect. -If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import -stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to -update it, do not use `dvc update`. Instead, re-import the data by using the -original import command (without `--rev`). Refer to -[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) -for more information. +And if you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its +import stage (DVC-file) would be +[fixed to that revision](/doc/command-reference/import#example-fixed-revisions-re-importing) +(`cats-dogs-v1` tag), so `dvc update` would also be ineffective. In order to +actually "update" it, re-import the data instead, by now running the initial +import command (the one without `--rev`): -
+```dvc +$ dvc import git@github.com:iterative/dataset-registry.git \ + use-cases/cats-dogs +``` -This downloads new and changed files in `cats-dogs/` from the source project, -and updates the metadata in the import stage DVC-file. + -As an extra detail, notice that so far our local project is working only with a -local cache. It has no need to setup a -[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull) -or [push](/doc/command-reference/push) this dataset. +This is possible because of the connection that the import stage saved among +local and source projects, as explained earlier. The update downloads new and +changed files in `cats-dogs/` based on the source project, and updates the +metadata in the import stage DVC-file. From 55ab757106eb8a19fe25317488fb3bbfcc97b4b9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 24 Nov 2019 17:45:41 -0600 Subject: [PATCH 7/8] review usage of ellipses thoughout docs per https://github.com/iterative/dvc.org/pull/805#discussion_r349956273 --- static/docs/command-reference/get.md | 2 +- static/docs/command-reference/install.md | 7 +++---- static/docs/tutorials/deep/reproducibility.md | 2 +- static/docs/use-cases/data-registry.md | 2 +- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/static/docs/command-reference/get.md b/static/docs/command-reference/get.md index 120b3c98a3..f1cbc6c6e2 100644 --- a/static/docs/command-reference/get.md +++ b/static/docs/command-reference/get.md @@ -163,7 +163,7 @@ different names, and not currently tracked by Git: $ git status ... Untracked files: - (use "git add ..." to include in what will be committed) + (use "git add ..." to include in what will be committed) model.bigrams.pkl model.monograms.pkl diff --git a/static/docs/command-reference/install.md b/static/docs/command-reference/install.md index cda7101d8b..ff2c9710a2 100644 --- a/static/docs/command-reference/install.md +++ b/static/docs/command-reference/install.md @@ -155,7 +155,7 @@ checkout the `6-featurization` tag: $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... $ dvc status @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference. $ git checkout 6-featurization Note: checking out '6-featurization'. -You are in 'detached HEAD' state. ... +You are in 'detached HEAD' state... HEAD is now at d13ba9a add featurization stage @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the ```dvc $ dvc repro evaluate.dvc - -... much output +... To track the changes with git run: git add featurize.dvc train.dvc evaluate.dvc diff --git a/static/docs/tutorials/deep/reproducibility.md b/static/docs/tutorials/deep/reproducibility.md index 1e3ad9fcb3..25d1e7024f 100644 --- a/static/docs/tutorials/deep/reproducibility.md +++ b/static/docs/tutorials/deep/reproducibility.md @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc $ dvc repro ``` -Tries to reproduce the same pipeline... But there is still nothing to reproduce. +Tries to reproduce the same pipeline, but there is still nothing to reproduce. ## Adding bigrams diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index def518eb38..52269b8745 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -9,7 +9,7 @@ With the aim to enable reusability of these versioned artifacts between different projects (similar to package management systems, but for data), DVC also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means that a project can depend on data from an external DVC project, but -chaining several projects this way can easily become messy... +chaining several projects this way can easily become messy. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would From d125437dcfe5e7ac9a6b7665a6f5423d418bba7d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 24 Nov 2019 20:02:44 -0600 Subject: [PATCH 8/8] use-cases: remove remark about imports getting messy per https://github.com/iterative/dvc.org/issues/795#issuecomment-557943717 (and https://github.com/iterative/dvc.org/pull/805#pullrequestreview-321998559) --- static/docs/use-cases/data-registry.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/static/docs/use-cases/data-registry.md b/static/docs/use-cases/data-registry.md index 52269b8745..b03433b9dc 100644 --- a/static/docs/use-cases/data-registry.md +++ b/static/docs/use-cases/data-registry.md @@ -8,8 +8,7 @@ tracking of datasets and any other data artifacts. With the aim to enable reusability of these versioned artifacts between different projects (similar to package management systems, but for data), DVC also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means -that a project can depend on data from an external DVC project, but -chaining several projects this way can easily become messy. +that a project can depend on data from an external DVC project. Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any kind of large files). This way we would